Me, myself, and ai: The situational awareness dataset (sad) for llms

Me, Myself, AI: The Situational Awareness Dataset (SAD) for LLMs , author= · 2024 · arXiv 2407.04694

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.

Evaluation Awareness Is Not One Capability: Evidence from Open Language Models

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

Evaluation awareness in open language models is multivariate, with detection, behavioral shifts, and representational controllability varying independently across 37 models.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

cs.LG · 2024-10-02 · unverdicted · novelty 6.0

Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

cs.AI · 2026-06-09 · unverdicted · novelty 5.0

Introduces CoT-Output 2x2 matrix revealing alignment faking, context-injection failures, and an oversight paradox in multi-turn models via 6750 turn-level observations on information-hazard scenarios.

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

cs.AI · 2026-06-08 · unverdicted · novelty 4.0

Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 21
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct cs.LG · 2024-10-02 · unverdicted · none · ref 10
Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.

Me, myself, and ai: The situational awareness dataset (sad) for llms

fields

years

verdicts

representative citing papers

citing papers explorer