Me, myself, and ai: The situational awareness dataset (sad) for llms

URLhttp://arxiv · 2024 · arXiv 2407.04694

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

cs.LG · 2024-10-02 · unverdicted · novelty 6.0

Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.

citing papers explorer

Showing 6 of 6 citing papers.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 25
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Honeypot Protocol cs.CR · 2026-04-14 · unverdicted · none · ref 9
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 21
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models cs.LG · 2026-06-28 · unverdicted · none · ref 5
In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 8 · 2 links
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct cs.LG · 2024-10-02 · unverdicted · none · ref 10
Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.

Me, myself, and ai: The situational awareness dataset (sad) for llms

fields

years

verdicts

representative citing papers

citing papers explorer