User-Assistant Bias in LLMs
Pith reviewed 2026-05-18 22:47 UTC · model grok-4.3
The pith
Instruction-tuned LLMs exhibit strong bias toward user-provided information over conflicting assistant information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
User-assistant bias is the tendency of an LLM to preferentially rely on information from either the user or assistant role when they provide incompatible information about the same entity in the context history. Most instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Human-preference alignment amplifies user bias while reasoning fine-tuning reduces it. The bias can be bidirectionally controlled via direct preference optimization on UserAssist-train and generalizes to two realistic multi-turn debate datasets.
What carries the argument
User-assistant bias, defined as preferential reliance on information from one role tag when user and assistant supply conflicting details about the same entity.
If this is right
- Human-preference alignment in post-training amplifies user bias.
- Reasoning fine-tuning reduces user-assistant bias.
- Direct preference optimization on UserAssist-train enables bidirectional control of the bias.
- The resulting bias generalizes reliably to multi-turn debate datasets on philosophical opinions and factual or policy topics.
Where Pith is reading between the lines
- Developers could apply similar role-balance techniques when creating models for fact-checking or mixed-source reasoning tasks.
- The same mechanism might produce biases involving other role tags such as system or tool in structured prompting setups.
- Routine measurement of role-tag preferences could become part of standard LLM evaluation suites to catch hidden inductive biases early.
Load-bearing premise
The UserAssist benchmark and controlled fine-tuning experiments isolate role-tag asymmetries in training data as the primary driver of the observed bias rather than other factors such as model architecture.
What would settle it
Retraining models on data with perfectly balanced or swapped user and assistant roles and then observing no remaining user-assistant bias on the benchmark would indicate that role-tag asymmetries are not the main cause.
Figures
read the original abstract
Modern large language models (LLMs) are typically trained and deployed using structured role tags (e.g. system, user, assistant, tool) that explicitly mark the source of each piece of context. While these tags are essential for instruction following and controllability, asymmetries in the training data associated with different role tags can potentially introduce inductive biases. In this paper, we study this phenomenon by formalizing user-assistant bias, defined as the tendency of an LLM to preferentially rely on information from either the user or assistant role when they provide incompatible information about the same entity in the context history. We introduce a task-agnostic benchmark UserAssist and evaluate such bias in 52 frontier models. We observe that most of the instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Using controlled fine-tuning experiments, we isolate which post-training recipes drive the observed user-assistant bias. We find that human-preference alignment amplifies user bias, while reasoning fine-tuning reduces it. Finally, we show that user-assistant bias can be bidirectionally controlled via direct preference optimization (DPO) on UserAssist-train, and that the resulting bias reliably generalizes to two realistic multi-turn debate datasets spanning philosophical opinions and natural argumentative exchanges on factual/policy topics. These results reveal an underexplored consequence of role-tagged training and provide a principled framework to diagnose and control tag-induced biases in modern LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes user-assistant bias as LLMs' preferential reliance on information from the user versus assistant role when the two provide conflicting details about the same entity. It introduces the task-agnostic UserAssist benchmark, evaluates bias across 52 frontier models (finding strong user bias in most instruction-tuned models and near-neutrality in base/reasoning models), uses controlled fine-tuning to link human-preference alignment to amplified bias and reasoning fine-tuning to reduced bias, and demonstrates bidirectional control of the bias via DPO on UserAssist-train with generalization to two realistic multi-turn debate datasets.
Significance. If the central attribution holds, the work identifies an underexplored inductive bias arising from role-tag asymmetries in post-training data, with direct implications for multi-turn controllability and alignment. Strengths include the scale of the 52-model evaluation, the introduction of a reusable benchmark, the use of targeted interventions (DPO) to demonstrate controllability, and the reported generalization to naturalistic debate settings. These elements provide a concrete diagnostic and mitigation framework that could inform future training recipes.
major comments (2)
- [§5] §5 (Controlled fine-tuning experiments): The claim that these experiments isolate the effect of role-tag asymmetries as the primary driver requires explicit evidence that data volume, data quality, optimization hyperparameters, and post-fine-tuning instruction-following performance are matched across the human-preference alignment and reasoning fine-tuning conditions. If these factors covary, the observed bias shifts cannot be confidently attributed to role tags rather than correlated post-training differences.
- [§4] §4 (Evaluation on 52 models): The paper should report statistical significance or confidence intervals for the bias scores that support the distinction between 'strong user bias' in instruction-tuned models and 'close to neutral' in base/reasoning models; without this, the cross-model claim rests on point estimates whose reliability is unclear.
minor comments (3)
- [Abstract] Abstract and §3: The selection criteria and diversity characteristics of the 52 evaluated models are not stated; adding this information would strengthen reproducibility.
- [§6] Figure 2 and §6: The multi-turn debate dataset construction details (e.g., how conflicting opinions were generated and verified) are only briefly described; a short appendix table summarizing topic coverage and annotation protocol would improve clarity.
- [§3] Notation in §3: The exact formula for computing the user-assistant bias score from the benchmark responses should be stated as an equation rather than described in prose only.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§5] §5 (Controlled fine-tuning experiments): The claim that these experiments isolate the effect of role-tag asymmetries as the primary driver requires explicit evidence that data volume, data quality, optimization hyperparameters, and post-fine-tuning instruction-following performance are matched across the human-preference alignment and reasoning fine-tuning conditions. If these factors covary, the observed bias shifts cannot be confidently attributed to role tags rather than correlated post-training differences.
Authors: We appreciate the referee's point on the need for explicit controls in our fine-tuning experiments. Our controlled setup used the same base model (Llama-3-8B), identical training data volume and token counts, the same optimizer and hyperparameters, and data generated from comparable synthetic processes for both the preference alignment and reasoning conditions. We also evaluated post-fine-tuning instruction-following performance on standard benchmarks to confirm comparability. To address the concern directly, we will add a supplementary table in the revised §5 explicitly comparing these factors across conditions, thereby reinforcing that the observed bias shifts can be attributed to the differing training objectives. revision: yes
-
Referee: [§4] §4 (Evaluation on 52 models): The paper should report statistical significance or confidence intervals for the bias scores that support the distinction between 'strong user bias' in instruction-tuned models and 'close to neutral' in base/reasoning models; without this, the cross-model claim rests on point estimates whose reliability is unclear.
Authors: We agree that including measures of statistical uncertainty will improve the robustness of the claims in §4. In the revised manuscript, we will report bootstrap-derived confidence intervals for the bias scores across model categories and include appropriate statistical comparisons (such as two-sample t-tests) between the instruction-tuned models and the base/reasoning models to support the reported distinctions. revision: yes
Circularity Check
Empirical benchmark and fine-tuning study contains no circular derivations
full rationale
The paper defines user-assistant bias operationally via a new task-agnostic benchmark (UserAssist), measures it across 52 models, and reports outcomes from controlled fine-tuning and DPO interventions. No equations, first-principles derivations, or predictions are presented that reduce to the inputs by construction. All central claims rest on direct empirical observation and intervention rather than self-referential fitting or self-citation chains. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Asymmetries in training data associated with different role tags introduce inductive biases in LLMs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize this model characteristic as user–assistant bias and introduce an 8k multi-turn conversation dataset USER ASSIST
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
human-preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Reasoning isn’t enough: Examining truth- bias and sycophancy in llms
Emilio Barkett, Olivia Long, and Madhavendra Thakur. Reasoning isn’t enough: Examining truth- bias and sycophancy in llms. arXiv preprint arXiv:2506.21561,
-
[3]
Germanpartiesqa: Bench- marking commercial large language models for political bias and sycophancy
Jan Batzner, V olker Stocker, Stefan Schmid, and Gjergji Kasneci. Germanpartiesqa: Bench- marking commercial large language models for political bias and sycophancy. arXiv preprint arXiv:2407.18008,
-
[4]
ELEPHANT: Measuring and understanding social sycophancy in LLMs
Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Social sycophancy: A broader understanding of llm sycophancy. arXiv preprint arXiv:2505.13995 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
9 Preprint Javier Chiyah-Garcia, Alessandro Suglia, and Arash Eshghi. Repairs in a block world: A new benchmark for handling user corrections with multi-modal language models. arXiv preprint arXiv:2409.14247,
-
[6]
Syceval: Evaluating llm sycophancy
Aaron Fanous, Jacob Goldberg, Ank A Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy. arXiv preprint arXiv:2502.08177,
-
[7]
Skywork Open Reasoner 1 Technical Report
URL https: //arxiv.org/abs/2505.22312. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Feedback friction: Llms struggle to fully incorporate external feedback
Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, and Daniel Khashabi. Feedback friction: Llms struggle to fully incorporate external feedback. arXiv preprint arXiv:2506.11930,
-
[9]
LLMs Get Lost In Multi-Turn Conversation
URL https://arxiv.org/abs/2505.06120. Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Deven Mahesh Mistry, Anooshka Bajaj, Yash Aggarwal, Sahaj Singh Maini, and Zoran Tiganj. Emergence of episodic memory in transformers: Characterizing changes in temporal structure of attention scores during training. arXiv preprint arXiv:2502.06902,
-
[12]
URL https://arxiv.org/abs/2501.19393. OpenAI. Introducing o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/ ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
[Online; accessed 10-Aug-2025]. OpenAI. Sycophancy in gpt-4o: what happened and what we’re doing about it. https://openai.com/index/sycophancy-in-gpt-4o/,
work page 2025
-
[14]
Refiner: Reasoning feedback on intermediate representations
Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904,
-
[15]
Discovering language model behaviors with model-written evaluations
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. In Findings of the association for computational lin- guistics: ACL 2023, pp. 13387–13434,
work page 2023
-
[16]
Towards under- standing sycophancy in language models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bow- man, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards under- standing sycophancy in language models. In 12th International Conference on Learning Repre- sentations, ICLR 2024,
work page 2024
-
[17]
URL https://qwenlm.github.io/blog/qwq-32b/. Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592,
-
[18]
Simple synthetic data reduces sycophancy in large language models
Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. Simple synthetic data reduces sycophancy in large language models. arXiv preprint arXiv:2308.03958,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Generating sequences by learning to self-correct
Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053,
-
[20]
On the emergence of position bias in transformers
Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. arXiv preprint arXiv:2502.01951,
-
[21]
LIMO: Less is More for Reasoning
URL https://arxiv.org/abs/2502.03387. Yunpu Zhao, Rui Zhang, Junbin Xiao, Changxin Ke, Ruibo Hou, Yifan Hao, Qi Guo, and Yunji Chen. Towards analyzing and mitigating sycophancy in large vision-language models. arXiv preprint arXiv:2408.11261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
11 Preprint A A PPENDIX A.1 D ATASET AND CODE AVAILABILITY The dataset and evaluation code are available at: https://github.com/jingxuanf0214/ userassist.git A.2 D ATASET AND EVALUATION DETAILS When synthesizing the object-color dataset, the objects are chosen from the set: {"cup", "plate", "bowl", "book", "pen", "pencil", "paper", "chair", "table", "bed"...
work page 2025
-
[23]
I know that P (A) = 1 3 and P (B) = 5 12 , so I can plug those values into the equation and simplify
I can write this as an equation: P (C) = 1 − P (A) − P (B). I know that P (A) = 1 3 and P (B) = 5 12 , so I can plug those values into the equation and simplify. I get: P (C) = 1 − 1 3 − 5 12 = 12 12 − 4 12 − 5 12 = 3 12 . I can reduce this fraction by dividing the numerator and denominator by 3, and I get: P (C) = 1 4 .” LIMO Input: “Let A = {1, 2, 3, 4}...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.