pith. sign in

hub

arXiv preprint arXiv:2311.03348 (2023)

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

hub tools

citation-role summary

background 3 method 1

citation-polarity summary

polarities

background 4

representative citing papers

On the Hardness of Junking LLMs

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

cs.CV · 2026-01-06 · unverdicted · novelty 7.0

GAMBIT constructs gamified instructional traps that decompose harmful visuals and drive MLLMs to reconstruct and answer malicious queries as part of winning a game, achieving over 85% attack success on models including GPT-4o and Gemini 2.5 Flash.

A StrongREJECT for Empty Jailbreaks

cs.LG · 2024-02-15 · conditional · novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

cs.CR · 2023-12-06 · unverdicted · novelty 3.0

Impersonating complex misaligned personas via biographies and role-play bypasses safety in ChatGPT, Gemini, and Deepseek, succeeding on 38-40 out of 40 illicit questions across tested models.

citing papers explorer

Showing 15 of 15 citing papers.