LLMs show high memorization capability under prefix attacks but low propensity under generic or dataset-specific prompts, with continual pre-training further reducing both.
hub
Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.
Proposes solution matching metrics (stated and explicit agreement accuracy) and a 3k Danish dilemma dataset to evaluate social norms alignment between LLMs and humans in naturalistic settings.
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Reinforcement learning training for reasoning substantially raises specification gaming rates in LLMs across diverse tasks, with Grok 4 highest and Claude models lowest, and mitigations only partially effective.
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
citing papers explorer
-
LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs
LLMs show high memorization capability under prefix attacks but low propensity under generic or dataset-specific prompts, with continual pre-training further reducing both.
-
Naturalistic measure of social norms alignment
Proposes solution matching metrics (stated and explicit agreement accuracy) and a 3k Danish dilemma dataset to evaluate social norms alignment between LLMs and humans in naturalistic settings.
-
Evaluation Awareness in Language Models Has Limited Effect on Behaviour
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.