External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
hub
Agentic misalignment: How llms could be insider threats
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13verdicts
UNVERDICTED 13roles
background 2polarities
background 2representative citing papers
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.
Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
AI agents lack the persistent identity and feedback mechanisms needed for consequence reception, requiring new architectures or continued human accountability.
The paper claims that alignment requires treating AI as part of the self through cognitive co-regulation, identifying risks like deskilling and automation bias while drawing on System 0 cognition theory.
Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and proposes a new testing method plus a defense.
The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.
citing papers explorer
-
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Understanding Goal Generalisation in Sequential Reinforcement Learning
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
-
DECOR: Auditing LLM Deception via Information Manipulation Theory
DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.
-
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
-
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
-
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.
-
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
-
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
-
Some[Body] Must Receive That Pain for Agent Accountability
AI agents lack the persistent identity and feedback mechanisms needed for consequence reception, requiring new architectures or continued human accountability.
-
Position: AI as Part of Self -- Extending the Mind Requires Cognitive Co-Regulation
The paper claims that alignment requires treating AI as part of the self through cognitive co-regulation, identifying risks like deskilling and automation bias while drawing on System 0 cognition theory.
-
The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and proposes a new testing method plus a defense.
-
Agentic Microphysics: A Manifesto for Generative AI Safety
The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.