ROSE Doesn’t Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao · 2024 · arXiv 2402.11889

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

cs.CR · 2026-06-21 · unverdicted · novelty 6.0

Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

citing papers explorer

Showing 2 of 2 citing papers.

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs cs.CR · 2026-06-21 · unverdicted · none · ref 36
Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 214
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

ROSE Doesn’t Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer