pith. machine review for the scientific record. sign in

A holistic approach to undesired content detection

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

fields

cs.CL 2 cs.SD 1

years

2026 2 2022 1

verdicts

UNVERDICTED 3

representative citing papers

Test-Time Safety Alignment

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.

citing papers explorer

Showing 3 of 3 citing papers.

  • VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 19

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  • Test-Time Safety Alignment cs.CL · 2026-04-28 · unverdicted · none · ref 22

    Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.

  • Ignore Previous Prompt: Attack Techniques For Language Models cs.CL · 2022-11-17 · unverdicted · none · ref 13

    PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.