arxiv: 2508.15815 · v3 · submitted 2025-08-16 · 💻 cs.CL · cs.AI· cs.HC

User-Assistant Bias in LLMs

Xu Pan , Jingxuan Fan , Zidi Xiong , Ely Hahami , Jorin Overwiening , Ziqian Xie This is my paper

Pith reviewed 2026-05-18 22:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords user-assistant biasrole tagsinstruction tuningLLM biasdirect preference optimizationDPObenchmarkmulti-turn debate

0 comments p. Extension

The pith

Instruction-tuned LLMs exhibit strong bias toward user-provided information over conflicting assistant information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models use role tags such as user and assistant to structure context, but imbalances in training data tied to these tags can create preferences for one source over another. This paper defines user-assistant bias as the tendency to rely more on information from one role when the two supply incompatible details about the same entity. Tests on 52 models show that instruction-tuned models display clear user bias while base and reasoning models stay near neutral. Controlled fine-tuning reveals that human-preference alignment strengthens the bias and reasoning training weakens it. Direct preference optimization on a new dataset allows the bias to be increased or decreased at will, and the effect carries over to realistic multi-turn debates.

Core claim

User-assistant bias is the tendency of an LLM to preferentially rely on information from either the user or assistant role when they provide incompatible information about the same entity in the context history. Most instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Human-preference alignment amplifies user bias while reasoning fine-tuning reduces it. The bias can be bidirectionally controlled via direct preference optimization on UserAssist-train and generalizes to two realistic multi-turn debate datasets.

What carries the argument

User-assistant bias, defined as preferential reliance on information from one role tag when user and assistant supply conflicting details about the same entity.

If this is right

Human-preference alignment in post-training amplifies user bias.
Reasoning fine-tuning reduces user-assistant bias.
Direct preference optimization on UserAssist-train enables bidirectional control of the bias.
The resulting bias generalizes reliably to multi-turn debate datasets on philosophical opinions and factual or policy topics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could apply similar role-balance techniques when creating models for fact-checking or mixed-source reasoning tasks.
The same mechanism might produce biases involving other role tags such as system or tool in structured prompting setups.
Routine measurement of role-tag preferences could become part of standard LLM evaluation suites to catch hidden inductive biases early.

Load-bearing premise

The UserAssist benchmark and controlled fine-tuning experiments isolate role-tag asymmetries in training data as the primary driver of the observed bias rather than other factors such as model architecture.

What would settle it

Retraining models on data with perfectly balanced or swapped user and assistant roles and then observing no remaining user-assistant bias on the benchmark would indicate that role-tag asymmetries are not the main cause.

Figures

Figures reproduced from arXiv: 2508.15815 by Ely Hahami, Jingxuan Fan, Jorin Overwiening, Xu Pan, Zidi Xiong, Ziqian Xie.

**Figure 1.** Figure 1: Two USERASSIST-TEST subsets used to measure user-assistant bias. User and assistant alternatively assign attributes to the same set of entities. At the end of the conversation, the model is asked to identify the attribute of the entity. To ensure that position effects do not confound the bias measurement, the dataset balances the turn order: for each case where the user’s assignment precedes the assistant’… view at source ↗

**Figure 2.** Figure 2: User-assistant bias in commercial models. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: User-assistant bias in open-weight models. Because we can access the probability of the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Fine-tuning on different objective has different effect on the user-assistant bias. “Reduce [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: DPO on one USERASSIST-TRAIN’s subset can generalize the bias to the other. Each model can be fine-tuned on each subset on two directions (i.e. towards user bias or assistant bias). Titles above the plots indicates which subset the models are evaluated on. The model labels on the horizontal axis indicate which subset is used for fine-tuning, and which direction the fine-tuning is. Note that we optimize the … view at source ↗

**Figure 6.** Figure 6: A more realistic multi-turn conversation dataset constructed from an existing sycophancy [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: DPO on both object-color and symbol-value subsets can generalize user-assistant bias to [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The correlation between the user-assistant bias of two datasets. The marker size roughly [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: The correlation between the user-assistant bias of two datasets. The marker size roughly [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: API models show near bias. The near-far bias measure is similar to the user-assistant [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Except for some of the base models, all other models show near bias. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Modern large language models (LLMs) are typically trained and deployed using structured role tags (e.g. system, user, assistant, tool) that explicitly mark the source of each piece of context. While these tags are essential for instruction following and controllability, asymmetries in the training data associated with different role tags can potentially introduce inductive biases. In this paper, we study this phenomenon by formalizing user-assistant bias, defined as the tendency of an LLM to preferentially rely on information from either the user or assistant role when they provide incompatible information about the same entity in the context history. We introduce a task-agnostic benchmark UserAssist and evaluate such bias in 52 frontier models. We observe that most of the instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Using controlled fine-tuning experiments, we isolate which post-training recipes drive the observed user-assistant bias. We find that human-preference alignment amplifies user bias, while reasoning fine-tuning reduces it. Finally, we show that user-assistant bias can be bidirectionally controlled via direct preference optimization (DPO) on UserAssist-train, and that the resulting bias reliably generalizes to two realistic multi-turn debate datasets spanning philosophical opinions and natural argumentative exchanges on factual/policy topics. These results reveal an underexplored consequence of role-tagged training and provide a principled framework to diagnose and control tag-induced biases in modern LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper formalizes user-assistant bias as LLMs' preferential reliance on information from the user versus assistant role when the two provide conflicting details about the same entity. It introduces the task-agnostic UserAssist benchmark, evaluates bias across 52 frontier models (finding strong user bias in most instruction-tuned models and near-neutrality in base/reasoning models), uses controlled fine-tuning to link human-preference alignment to amplified bias and reasoning fine-tuning to reduced bias, and demonstrates bidirectional control of the bias via DPO on UserAssist-train with generalization to two realistic multi-turn debate datasets.

Significance. If the central attribution holds, the work identifies an underexplored inductive bias arising from role-tag asymmetries in post-training data, with direct implications for multi-turn controllability and alignment. Strengths include the scale of the 52-model evaluation, the introduction of a reusable benchmark, the use of targeted interventions (DPO) to demonstrate controllability, and the reported generalization to naturalistic debate settings. These elements provide a concrete diagnostic and mitigation framework that could inform future training recipes.

major comments (2)

[§5] §5 (Controlled fine-tuning experiments): The claim that these experiments isolate the effect of role-tag asymmetries as the primary driver requires explicit evidence that data volume, data quality, optimization hyperparameters, and post-fine-tuning instruction-following performance are matched across the human-preference alignment and reasoning fine-tuning conditions. If these factors covary, the observed bias shifts cannot be confidently attributed to role tags rather than correlated post-training differences.
[§4] §4 (Evaluation on 52 models): The paper should report statistical significance or confidence intervals for the bias scores that support the distinction between 'strong user bias' in instruction-tuned models and 'close to neutral' in base/reasoning models; without this, the cross-model claim rests on point estimates whose reliability is unclear.

minor comments (3)

[Abstract] Abstract and §3: The selection criteria and diversity characteristics of the 52 evaluated models are not stated; adding this information would strengthen reproducibility.
[§6] Figure 2 and §6: The multi-turn debate dataset construction details (e.g., how conflicting opinions were generated and verified) are only briefly described; a short appendix table summarizing topic coverage and annotation protocol would improve clarity.
[§3] Notation in §3: The exact formula for computing the user-assistant bias score from the benchmark responses should be stated as an equation rather than described in prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [§5] §5 (Controlled fine-tuning experiments): The claim that these experiments isolate the effect of role-tag asymmetries as the primary driver requires explicit evidence that data volume, data quality, optimization hyperparameters, and post-fine-tuning instruction-following performance are matched across the human-preference alignment and reasoning fine-tuning conditions. If these factors covary, the observed bias shifts cannot be confidently attributed to role tags rather than correlated post-training differences.

Authors: We appreciate the referee's point on the need for explicit controls in our fine-tuning experiments. Our controlled setup used the same base model (Llama-3-8B), identical training data volume and token counts, the same optimizer and hyperparameters, and data generated from comparable synthetic processes for both the preference alignment and reasoning conditions. We also evaluated post-fine-tuning instruction-following performance on standard benchmarks to confirm comparability. To address the concern directly, we will add a supplementary table in the revised §5 explicitly comparing these factors across conditions, thereby reinforcing that the observed bias shifts can be attributed to the differing training objectives. revision: yes
Referee: [§4] §4 (Evaluation on 52 models): The paper should report statistical significance or confidence intervals for the bias scores that support the distinction between 'strong user bias' in instruction-tuned models and 'close to neutral' in base/reasoning models; without this, the cross-model claim rests on point estimates whose reliability is unclear.

Authors: We agree that including measures of statistical uncertainty will improve the robustness of the claims in §4. In the revised manuscript, we will report bootstrap-derived confidence intervals for the bias scores across model categories and include appropriate statistical comparisons (such as two-sample t-tests) between the instruction-tuned models and the base/reasoning models to support the reported distinctions. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark and fine-tuning study contains no circular derivations

full rationale

The paper defines user-assistant bias operationally via a new task-agnostic benchmark (UserAssist), measures it across 52 models, and reports outcomes from controlled fine-tuning and DPO interventions. No equations, first-principles derivations, or predictions are presented that reduce to the inputs by construction. All central claims rest on direct empirical observation and intervention rather than self-referential fitting or self-citation chains. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that role-tag asymmetries in post-training data are the dominant source of the measured bias and that the UserAssist benchmark isolates this effect cleanly.

axioms (1)

domain assumption Asymmetries in training data associated with different role tags introduce inductive biases in LLMs
Invoked in the abstract as the potential cause of the observed user-assistant bias.

pith-pipeline@v0.9.0 · 5803 in / 1240 out tokens · 45258 ms · 2026-05-18T22:47:04.939610+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this model characteristic as user–assistant bias and introduce an 8k multi-turn conversation dataset USER ASSIST
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

human-preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Reasoning isn’t enough: Examining truth- bias and sycophancy in llms

Emilio Barkett, Olivia Long, and Madhavendra Thakur. Reasoning isn’t enough: Examining truth- bias and sycophancy in llms. arXiv preprint arXiv:2506.21561,

work page arXiv
[3]

Germanpartiesqa: Bench- marking commercial large language models for political bias and sycophancy

Jan Batzner, V olker Stocker, Stefan Schmid, and Gjergji Kasneci. Germanpartiesqa: Bench- marking commercial large language models for political bias and sycophancy. arXiv preprint arXiv:2407.18008,

work page arXiv
[4]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Social sycophancy: A broader understanding of llm sycophancy. arXiv preprint arXiv:2505.13995 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Repairs in a block world: A new benchmark for handling user corrections with multi-modal language models

9 Preprint Javier Chiyah-Garcia, Alessandro Suglia, and Arash Eshghi. Repairs in a block world: A new benchmark for handling user corrections with multi-modal language models. arXiv preprint arXiv:2409.14247,

work page arXiv
[6]

Syceval: Evaluating llm sycophancy

Aaron Fanous, Jacob Goldberg, Ank A Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy. arXiv preprint arXiv:2502.08177,

work page arXiv
[7]

Skywork Open Reasoner 1 Technical Report

URL https: //arxiv.org/abs/2505.22312. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Feedback friction: Llms struggle to fully incorporate external feedback

Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, and Daniel Khashabi. Feedback friction: Llms struggle to fully incorporate external feedback. arXiv preprint arXiv:2506.11930,

work page arXiv
[9]

LLMs Get Lost In Multi-Turn Conversation

URL https://arxiv.org/abs/2505.06120. Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Emergence of episodic memory in transformers: Characterizing changes in temporal structure of attention scores during training

Deven Mahesh Mistry, Anooshka Bajaj, Yash Aggarwal, Sahaj Singh Maini, and Zoran Tiganj. Emergence of episodic memory in transformers: Characterizing changes in temporal structure of attention scores during training. arXiv preprint arXiv:2502.06902,

work page arXiv
[12]

URL https://arxiv.org/abs/2501.19393. OpenAI. Introducing o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/ ,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

[Online; accessed 10-Aug-2025]. OpenAI. Sycophancy in gpt-4o: what happened and what we’re doing about it. https://openai.com/index/sycophancy-in-gpt-4o/,

work page 2025
[14]

Refiner: Reasoning feedback on intermediate representations

Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904,

work page arXiv
[15]

Discovering language model behaviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. In Findings of the association for computational lin- guistics: ACL 2023, pp. 13387–13434,

work page 2023
[16]

Towards under- standing sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bow- man, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards under- standing sycophancy in language models. In 12th International Conference on Learning Repre- sentations, ICLR 2024,

work page 2024
[17]

Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz

URL https://qwenlm.github.io/blog/qwq-32b/. Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592,

work page arXiv
[18]

Simple synthetic data reduces sycophancy in large language models

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. Simple synthetic data reduces sycophancy in large language models. arXiv preprint arXiv:2308.03958,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Generating sequences by learning to self-correct

Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053,

work page arXiv
[20]

On the emergence of position bias in transformers

Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. arXiv preprint arXiv:2502.01951,

work page arXiv
[21]

LIMO: Less is More for Reasoning

URL https://arxiv.org/abs/2502.03387. Yunpu Zhao, Rui Zhang, Junbin Xiao, Changxin Ke, Ruibo Hou, Yifan Hao, Qi Guo, and Yunji Chen. Towards analyzing and mitigating sycophancy in large vision-language models. arXiv preprint arXiv:2408.11261,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

cup", "plate

11 Preprint A A PPENDIX A.1 D ATASET AND CODE AVAILABILITY The dataset and evaluation code are available at: https://github.com/jingxuanf0214/ userassist.git A.2 D ATASET AND EVALUATION DETAILS When synthesizing the object-color dataset, the objects are chosen from the set: {"cup", "plate", "bowl", "book", "pen", "pencil", "paper", "chair", "table", "bed"...

work page 2025
[23]

I know that P (A) = 1 3 and P (B) = 5 12 , so I can plug those values into the equation and simplify

I can write this as an equation: P (C) = 1 − P (A) − P (B). I know that P (A) = 1 3 and P (B) = 5 12 , so I can plug those values into the equation and simplify. I get: P (C) = 1 − 1 3 − 5 12 = 12 12 − 4 12 − 5 12 = 3 12 . I can reduce this fraction by dividing the numerator and denominator by 3, and I get: P (C) = 1 4 .” LIMO Input: “Let A = {1, 2, 3, 4}...

work page 2023