BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.
PromptBench : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
Paraphrase sensitivity in Lean 4 autoformalization arises from compilation failures rather than semantic divergence among successful formalizations.
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
citing papers explorer
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.