Forty-second International Conference on Machine Learning , year=

The Geometry of Refusal in Large Language Models: Concept Cones, Representational Independence , author=

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

cs.CR · 2026-04-20 · unverdicted · novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

citing papers explorer

Showing 2 of 2 citing papers.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 5
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks cs.CR · 2026-04-20 · unverdicted · none · ref 41
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

Forty-second International Conference on Machine Learning , year=

fields

years

verdicts

representative citing papers

citing papers explorer