Me, myself, and ai: The situational awareness dataset (sad) for llms

https://arxiv · 2024 · arXiv 2407.04694

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

representative citing papers

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.

Evaluation Awareness Is Not One Capability: Evidence from Open Language Models

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

Evaluation awareness in open language models is multivariate, with detection, behavioral shifts, and representational controllability varying independently across 37 models.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

cs.LG · 2024-10-02 · unverdicted · novelty 6.0

Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

cs.AI · 2026-06-09 · unverdicted · novelty 5.0

Introduces CoT-Output 2x2 matrix revealing alignment faking, context-injection failures, and an oversight paradox in multi-turn models via 6750 turn-level observations on information-hazard scenarios.

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

cs.AI · 2026-06-08 · unverdicted · novelty 4.0

Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.

AI Integrity: Defending Against Backdoors and Secret Loyalties

cs.CY · 2026-04-25 · conditional · novelty 4.0

The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 25
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Honeypot Protocol cs.CR · 2026-04-14 · unverdicted · none · ref 9
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models cs.LG · 2026-06-28 · unverdicted · none · ref 5
In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.
Evaluation Awareness Is Not One Capability: Evidence from Open Language Models cs.CL · 2026-06-22 · unverdicted · none · ref 5
Evaluation awareness in open language models is multivariate, with detection, behavioral shifts, and representational controllability varying independently across 37 models.
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 8 · 2 links
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models cs.AI · 2026-06-09 · unverdicted · none · ref 10
Introduces CoT-Output 2x2 matrix revealing alignment faking, context-injection failures, and an oversight paradox in multi-turn models via 6750 turn-level observations on information-hazard scenarios.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization cs.AI · 2026-06-08 · unverdicted · none · ref 17
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
AI Integrity: Defending Against Backdoors and Secret Loyalties cs.CY · 2026-04-25 · conditional · none · ref 21
The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.

Me, myself, and ai: The situational awareness dataset (sad) for llms

fields

years

verdicts

representative citing papers

citing papers explorer