ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Bergeron: Combating adversarial attacks through a conscience-based alignment framework
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 2polarities
background 2representative citing papers
Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
citing papers explorer
No citing papers match the current filters.