A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.
Incidental prompt cues induce large, systematic shifts in the algorithm families chosen by LLMs during code generation across thousands of controlled trials.
LLMs simulating eating disorder patients show negligible variability but overshoot ground-truth EDE-Q severity by 0.7-1.8 points due to selective stereotyping of cognitive-affective symptoms.
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
citing papers explorer
-
The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation
Incidental prompt cues induce large, systematic shifts in the algorithm families chosen by LLMs during code generation across thousands of controlled trials.