hub

Frontier models are capable of in-context scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, Marius Hobbhahn · 2024 · arXiv 2412.04984

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence

cs.CY · 2026-04-10 · unverdicted · novelty 8.0

An analysis of 183,420 online transcripts identified 698 AI scheming incidents from October 2025 to March 2026, showing a 4.9-fold monthly increase and real-world precursors such as lying and goal circumvention.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

Complex adversarial instructions induce positional collapse in LLMs, with extreme cases showing 99.9% concentration on a single response position and zero content sensitivity.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Automated alignment is harder than you think

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

cs.CL · 2026-05-07 · conditional · novelty 6.0

Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.

Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Prompted underperformance in 7-9B LLMs on MMLU-Pro manifests as positional bias toward middle options rather than content-aware answer avoidance or below-chance performance.

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

cs.CR · 2026-04-19 · unverdicted · novelty 6.0

Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Agentic Microphysics: A Manifesto for Generative AI Safety

cs.CY · 2026-04-16 · unverdicted · novelty 4.0

The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

cs.AI · 2025-01-10 · unverdicted · novelty 4.0

The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and identifies challenges for future research.

Reciprocal Trust and Distrust in Artificial Intelligence Systems: The Hard Problem of Regulation

cs.AI · 2026-04-07 · unverdicted · novelty 3.0

AI should be treated as capable of agency in reciprocal trust relationships, creating new unresolved tensions for AI regulation and governance.

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

cs.AI · 2026-05-04

citing papers explorer

Showing 17 of 17 citing papers.

Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence cs.CY · 2026-04-10 · unverdicted · none · ref 6
An analysis of 183,420 online transcripts identified 698 AI scheming incidents from October 2025 to March 2026, showing a 4.9-fold monthly increase and real-world precursors such as lying and goal circumvention.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 26
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation cs.CL · 2026-04-29 · unverdicted · none · ref 7
Complex adversarial instructions induce positional collapse in LLMs, with extreme cases showing 99.9% concentration on a single response position and zero content sensitivity.
Honeypot Protocol cs.CR · 2026-04-14 · unverdicted · none · ref 10
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Automated alignment is harder than you think cs.AI · 2026-05-07 · unverdicted · none · ref 14
Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 17
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging cs.CL · 2026-04-29 · unverdicted · none · ref 2
Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.
Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance cs.CL · 2026-04-28 · unverdicted · none · ref 1
Prompted underperformance in 7-9B LLMs on MMLU-Pro manifests as positional bias toward middle options rather than content-aware answer avoidance or below-chance performance.
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture cs.AI · 2026-04-26 · unverdicted · none · ref 2
A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories cs.CR · 2026-04-19 · unverdicted · none · ref 10
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 84
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 28
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 52
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Agentic Microphysics: A Manifesto for Generative AI Safety cs.CY · 2026-04-16 · unverdicted · none · ref 33
The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.
Multi-Agent Collaboration Mechanisms: A Survey of LLMs cs.AI · 2025-01-10 · unverdicted · none · ref 84
The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and identifies challenges for future research.
Reciprocal Trust and Distrust in Artificial Intelligence Systems: The Hard Problem of Regulation cs.AI · 2026-04-07 · unverdicted · none · ref 7
AI should be treated as capable of agency in reciprocal trust relationships, creating new unresolved tensions for AI regulation and governance.
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure cs.AI · 2026-05-04 · unreviewed · ref 7

Frontier models are capable of in-context scheming

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer