LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
ArXiv , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Defines Entropy-Gradient Inversion as a negative entropy-gradient correlation fingerprinting LRM reasoning and proposes CorR-PO to embed it in RL regularization, claiming consistent outperformance on benchmarks.
citing papers explorer
-
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
-
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
Defines Entropy-Gradient Inversion as a negative entropy-gradient correlation fingerprinting LRM reasoning and proposes CorR-PO to embed it in RL regularization, claiming consistent outperformance on benchmarks.