pith. machine review for the scientific record. sign in

hub

Risks from learned optimization in advanced machine learning systems

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

hub tools

years

2026 18 2022 1

verdicts

UNVERDICTED 19

clear filters

representative citing papers

Mechanistic Anomaly Detection via Functional Attribution

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Functional attribution with influence functions detects anomalous mechanisms in neural networks, achieving SOTA backdoor detection (average DER 0.93) on vision benchmarks and improvements on LLMs.

Safety, Security, and Cognitive Risks in World Models

cs.CR · 2026-04-01 · unverdicted · novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.