pith. machine review for the scientific record. sign in

citation dossier

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts

J · 2023 · arXiv 2309.10253

18Pith papers citing it
20reference links
cs.CRtop field · 10 papers
UNVERDICTEDtop verdict bucket · 17 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 18 reviewed papers. Its strongest current cluster is cs.CR (10 papers). The largest review-status bucket among citing papers is UNVERDICTED (17 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

years

2026 18

representative citing papers

On the Hardness of Junking LLMs

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.

Adaptive Instruction Composition for Automated LLM Red-Teaming

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.

LLM-Agnostic Semantic Representation Attack

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.

The Power of Order: Fooling LLMs with Adversarial Table Permutations

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.

SoK: Robustness in Large Language Models against Jailbreak Attacks

cs.CR · 2026-05-06 · accept · novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

PIArena: A Platform for Prompt Injection Evaluation

cs.CR · 2026-04-09 · unverdicted · novelty 5.0

PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.

citing papers explorer

Showing 18 of 18 citing papers.