RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.
citing papers explorer
-
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
-
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics
A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.