A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.
Incidental prompt cues induce large, systematic shifts in the algorithm families chosen by LLMs during code generation across thousands of controlled trials.
LLMs simulating eating disorder patients show negligible variability but overshoot ground-truth EDE-Q severity by 0.7-1.8 points due to selective stereotyping of cognitive-affective symptoms.
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
citing papers explorer
-
Can AI Agents Synthesize Scientific Conclusions?
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
-
Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions
Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.
-
The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation
Incidental prompt cues induce large, systematic shifts in the algorithm families chosen by LLMs during code generation across thousands of controlled trials.
-
Simulating Eating Disorder Patients with LLMs: Evaluating Psychological Persona Stability in Multi-Turn Conversations
LLMs simulating eating disorder patients show negligible variability but overshoot ground-truth EDE-Q severity by 0.7-1.8 points due to selective stereotyping of cognitive-affective symptoms.
-
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.