pith. sign in

When chain of thought is necessary, language models struggle to evade monitors

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

citation-role summary

background 3 method 1

citation-polarity summary

years

2026 10

clear filters

representative citing papers

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

CORE distills contrasts between successful and unsuccessful reasoning traces into compact natural-language insights that enable faster model self-improvement on reasoning tasks with fewer rollouts than parametric or other non-parametric baselines.

How Transparent is DiffusionGemma?

cs.LG · 2026-06-18 · unverdicted · novelty 6.0

DiffusionGemma matches Gemma 4 in variable transparency and monitorability after applying an interpretable token bottleneck, despite higher naive serial depth, and shows novel phenomena such as non-chronological reasoning.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

citing papers explorer

Showing 2 of 2 citing papers after filters.