pith. sign in

hub

Ai control: Improving safety despite intentional subversion

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

citation-role summary

background 2

citation-polarity summary

years

2026 9 2025 1

roles

background 2

polarities

background 2

representative citing papers

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

Automated alignment is harder than you think

cs.AI · 2026-05-07 · conditional · novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.

Detecting Safety Violations Across Many Agent Traces

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Meerkat uses clustering plus agentic search to detect sparse safety violations across many agent traces, outperforming baselines and finding nearly 4x more reward-hacking cases on CyBench.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 10 of 10 citing papers.