pith. sign in

Refusal in language models is mediated by a single direction

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

years

2026 5

representative citing papers

Do Thinking Tokens Help with Safety?

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

cs.CR · 2026-06-18 · unverdicted · novelty 6.0

SafeSpec integrates a latent safety head into speculative LLM decoding with rollback and reflective multi-sampling, cutting attack success rates 15% on Qwen3-32B while retaining 2.06x speedup on normal workloads.

citing papers explorer

Showing 5 of 5 citing papers.