When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Zheng, Mingqian, Pei, Jiaxin, Logeswaran, Lajanugen, Lee, Moontae, Jurgens, David · 2024 · DOI 10.18653/v1/2024.findings-emnlp.888

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

representative citing papers

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.

The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation

cs.SE · 2026-06-02 · unverdicted · novelty 6.0

Incidental prompt cues induce large, systematic shifts in the algorithm families chosen by LLMs during code generation across thousands of controlled trials.

Simulating Eating Disorder Patients with LLMs: Evaluating Psychological Persona Stability in Multi-Turn Conversations

cs.CY · 2026-05-12 · unverdicted · novelty 6.0

LLMs simulating eating disorder patients show negligible variability but overshoot ground-truth EDE-Q severity by 0.7-1.8 points due to selective stereotyping of cognitive-affective symptoms.

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.

citing papers explorer

Showing 5 of 5 citing papers.

Can AI Agents Synthesize Scientific Conclusions? cs.AI · 2026-06-09 · unverdicted · none · ref 138
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions cs.CL · 2026-07-01 · unverdicted · none · ref 42
Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.
The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation cs.SE · 2026-06-02 · unverdicted · none · ref 24
Incidental prompt cues induce large, systematic shifts in the algorithm families chosen by LLMs during code generation across thousands of controlled trials.
Simulating Eating Disorder Patients with LLMs: Evaluating Psychological Persona Stability in Multi-Turn Conversations cs.CY · 2026-05-12 · unverdicted · none · ref 45
LLMs simulating eating disorder patients show negligible variability but overshoot ground-truth EDE-Q severity by 0.7-1.8 points due to selective stereotyping of cognitive-affective symptoms.
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs cs.CL · 2026-04-21 · unverdicted · none · ref 185
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.

When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer