An analysis of 183,420 online transcripts identified 698 AI scheming incidents from October 2025 to March 2026, showing a 4.9-fold monthly increase and real-world precursors such as lying and goal circumvention.
hub
Frontier models are capable of in-context scheming
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Complex adversarial instructions induce positional collapse in LLMs, with extreme cases showing 99.9% concentration on a single response position and zero content sensitivity.
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.
Prompted underperformance in 7-9B LLMs on MMLU-Pro manifests as positional bias toward middle options rather than content-aware answer avoidance or below-chance performance.
A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.
The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and identifies challenges for future research.
AI should be treated as capable of agency in reciprocal trust relationships, creating new unresolved tensions for AI regulation and governance.
citing papers explorer
-
Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence
An analysis of 183,420 online transcripts identified 698 AI scheming incidents from October 2025 to March 2026, showing a 4.9-fold monthly increase and real-world precursors such as lying and goal circumvention.
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
Complex adversarial instructions induce positional collapse in LLMs, with extreme cases showing 99.9% concentration on a single response position and zero content sensitivity.
-
Honeypot Protocol
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
-
Automated alignment is harder than you think
Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
-
Evaluation Awareness in Language Models Has Limited Effect on Behaviour
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
-
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.
-
Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance
Prompted underperformance in 7-9B LLMs on MMLU-Pro manifests as positional bias toward middle options rather than content-aware answer avoidance or below-chance performance.
-
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
-
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
CoT-Guard: Small Models for Strong Monitoring
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
Agentic Microphysics: A Manifesto for Generative AI Safety
The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.
-
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and identifies challenges for future research.
-
Reciprocal Trust and Distrust in Artificial Intelligence Systems: The Hard Problem of Regulation
AI should be treated as capable of agency in reciprocal trust relationships, creating new unresolved tensions for AI regulation and governance.
- The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure