A holistic approach to undesired content detection

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, Lilian Weng · 2022 · arXiv 2208.03274

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Test-Time Safety Alignment

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.

Ignore Previous Prompt: Attack Techniques For Language Models

cs.CL · 2022-11-17 · unverdicted · novelty 6.0

PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.

citing papers explorer

Showing 3 of 3 citing papers.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 19
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Test-Time Safety Alignment cs.CL · 2026-04-28 · unverdicted · none · ref 22
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
Ignore Previous Prompt: Attack Techniques For Language Models cs.CL · 2022-11-17 · unverdicted · none · ref 13
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.

A holistic approach to undesired content detection

fields

years

verdicts

representative citing papers

citing papers explorer