ArXiv , year=

Training a Helpful, Harmless Assistant with Reinforcement Learning from Human Feedback , author=

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Defines Entropy-Gradient Inversion as a negative entropy-gradient correlation fingerprinting LRM reasoning and proposes CorR-PO to embed it in RL regularization, claiming consistent outperformance on benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 33
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models cs.AI · 2026-05-18 · unverdicted · none · ref 44
Defines Entropy-Gradient Inversion as a negative entropy-gradient correlation fingerprinting LRM reasoning and proposes CorR-PO to embed it in RL regularization, claiming consistent outperformance on benchmarks.

ArXiv , year=

fields

years

verdicts

representative citing papers

citing papers explorer