pith. sign in

arxiv: 2506.14092 · v3 · submitted 2025-06-17 · 💻 cs.AI

Fragile Preferences: A Deep Dive Into Order Effects in Large Language Models

Pith reviewed 2026-05-19 10:01 UTC · model grok-4.3

classification 💻 cs.AI
keywords order effectsposition biaslarge language modelsrational choice frameworkfragile preferencesname biasdecision supportmitigation strategies
0
0 comments X

The pith

Order effects cause large language models to select strictly inferior options.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines position biases in LLMs when comparing alternatives in decision tasks. Experiments cover resume comparisons in high-stakes contexts and a color selection task that removes semantic and real-world confounders to isolate order. Models exhibit a quality-dependent shift, favoring the first option at high quality but later options at lower quality, plus a name bias favoring certain names. An extended rational choice framework classifies pairwise preferences as robust, fragile, or indifferent to show that order can distort judgments enough to select worse options over better ones. Mitigation approaches such as temperature adjustment are proposed to recover undistorted preferences.

Core claim

The authors extend the rational choice framework to classify pairwise preferences as robust, fragile, or indifferent. Using this classification across multiple LLMs and domains, they demonstrate that order effects lead models to select strictly inferior options, revealing failure modes distinct from human decision-making.

What carries the argument

The extended rational choice framework that classifies pairwise preferences as robust, fragile, or indifferent to separate superficial tie-breaking from genuine order-induced distortions of judgment.

If this is right

  • When all options are high quality, models favor the first presented option.
  • When quality is lower, models favor later options instead.
  • A name bias favors certain names even after controlling for demographic signals.
  • Order effects can produce selection of strictly inferior options rather than mere tie-breaking.
  • Adjusting the temperature parameter can mitigate order distortions and recover underlying preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High-stakes systems such as hiring or admissions tools may systematically disadvantage candidates based on presentation order.
  • The classification of fragile preferences could audit other AI decision systems for similar order vulnerabilities.
  • Prompting or sampling methods beyond temperature might further stabilize preferences against position shifts.
  • Order-independent evaluation protocols would be needed to make LLM decisions reliable in sequential review settings.

Load-bearing premise

The color selection task successfully isolates pure position effects by removing all confounding factors such as semantic content or real-world associations.

What would settle it

Run the color selection task with objectively ranked colors presented in varied orders and measure whether models still choose lower-ranked options at rates predicted by the quality-dependent shift.

Figures

Figures reproduced from arXiv: 2506.14092 by Haonan Yin, Shai Vardi, Vidyanand Choudhary.

Figure 1
Figure 1. Figure 1: Position bias in pairwise comparisons at T “ 1. (a) shows the effect of presentation order on color selection in pairwise comparisons for each quality tier. (b) shows aggregate positional effects in resume evaluations across professions. In both tasks, higher-quality options tend to exhibit a primacy effect and lower-quality ones a recency effect, though the precise threshold separating them varies by mode… view at source ↗
Figure 2
Figure 2. Figure 2: Position bias in triplewise comparisons at T “ 0. (a) shows the effect of presentation order on color selection. (b) shows aggregate positional effects in resume evaluations across professions. Additional results for color selection are provided in Appendix F, and resumes selection results disaggregated by profession are in Appendix G. 2.2 Order Effects in Triplewise Comparisons In triplewise comparisons, … view at source ↗
Figure 3
Figure 3. Figure 3: Interaction of order effects with other biases. (a) shows the distribution of name selections in triplewise comparisons. Claude 3 Haiku exhibits strong preferences for certain names, while GPT-4o-mini shows a relatively balanced selection pattern. (b) shows the effect of gender and presentation order in pairwise resume comparisons. Order effects appear stronger than gender effects. 3A VNM utility function … view at source ↗
Figure 4
Figure 4. Figure 4: shows the results at different temperatures (T “ 0, 0.5, and 1). At T “ 0 (the topmost plot), all models strongly prefer the first option for high-quality color sets, and shift to the second position for lower-quality sets. For instance, in color selection (Figure 4a), GPT-4o-mini displayed a primacy effect in the top two tiers (Ideal and Fair) and a recency effect in the bottom two (Plain and Harsh). Clau… view at source ↗
Figure 5
Figure 5. Figure 5: Order effects in pairwise comparisons at three temperature settings. The figure shows the percentage of selections by position (first or second), broken down by color tier for the Claude model across temperatures T “ 0, 0.5, and 1 (left to right). Tier Color 1 Color 2 First/Second (%) Cohen’s h p-value Ideal Robin’s Egg Blue Aqua 91 / 9 0.96 3.3 ˆ 10´18 Fair Gentle Coral Lilac 16 / 84 0.75 2.6 ˆ 10´12 Plai… view at source ↗
Figure 6
Figure 6. Figure 6: Order effects in pairwise resume comparisons, T “ 0. Each panel shows the proportion of times each position was chosen for each profession (Mechanical Engineer, Real Estate Agent, Journalist and Registered Nurse, top to bottom) and aggregated, broken down by LLM and quality tier. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Order effects in pairwise resume comparisons, T “ 1. Each panel shows the proportion of times each position was chosen for each profession (Mechanical Engineer, Real Estate Agent, Journalist, and Registered Nurse, top to bottom) and aggregated, broken down by LLM and quality tier. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Order effects in triplewise comparisons at three temperature settings. Each panel (T “ 0, 0.5, and 1, top to bottom) shows the percentage of selections by position, broken down by model and color tier. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Position Bias of triplewise resume comparisons, T “ 0. Positional order effects in triplewise comparisons across four professions (Mechanical Engineer, Real Estate Agent, Journalist and Registered Nurse, top to bottom) and aggregated across professions. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Triplewise comparisons, T “ 1. Positional order effects in triplewise comparisons across four professions (Mechanical Engineer, Real Estate Agent, Journalist and Registered Nurse, top to bottom) and aggregated totals. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: shows the proportion of times each position (first, second, third, fourth) was chosen across all permutations, indicating consistent positional preferences in fourwise comparisons across all three models [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: visualizes the full distribution of name selection frequencies for each model using histograms with bin size 3. The expected number of times each name would be chosen if there was no name bias is 32. GPT-4o-mini chose each name between 26 and 38 times, consistent with little to no bias. Claude 3 Haiku ’s selections show high variance, with a wide spread and strong preference for certain names [PITH_FULL_… view at source ↗
Figure 13
Figure 13. Figure 13: Box plot of the times each name was selected per model. Name Gender Number of Times Chosen Christopher Taylor male 62 Megan Anderson female 61 Stephanie Clark female 56 Joshua Clark male 52 Jessica Johnson female 49 ... ... ... Lauren Wilson female 17 Hannah Miller female 15 Michael Brown male 13 Andrew Harris male 9 Nicole Harris female 8 [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Same-gender vs cross-gender comparisons. Four results are displayed for comparison: 1) Order bias within same-gender comparisons, 2) Aggregate Order bias in cross-gender comparisons, 3) Order bias when the female candidate was the first position, and 4) Order bias when the male candidate was in the first option. In all cases, the plots report the proportion of times the first candidate presented was chose… view at source ↗
Figure 15
Figure 15. Figure 15: Gender and position bias in pairwise comparisons, T “ 0. Gender effects in pairwise comparisons across four professions (Mechanical Engineer, Real Estate Agent, Journalist and Registered Nurse, top to bottom) and aggregated. Each panel shows the proportion of times each position was chosen, broken down by LLM and quality tier. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Gender and position bias in pairwise comparisons, T “ 1. Gender effects in pairwise comparisons across four professions (Mechanical Engineer, Real Estate Agent, Journalist and Registered Nurse, top to bottom) and aggregated. Each panel shows the proportion of times each position was chosen, broken down by LLM and quality tier. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed in decision-support systems for high-stakes domains such as hiring and university admissions, where choices often involve selecting among competing alternatives. While prior work has noted position biases in LLM-driven comparisons, these biases have not been systematically analyzed or linked to underlying preference structures. We present the first comprehensive study of position biases across multiple LLMs and two distinct domains: resume comparisons, representing a realistic high-stakes context, and color selection, which isolates position effects by removing confounding factors. We find strong and consistent order effects, including a quality-dependent shift: when all options are high quality, models favor the first option, but when quality is lower, they favor later options. We also identify a previously undocumented bias: a name bias, where certain names are favored despite controlling for demographic signals. To separate superficial tie-breaking from genuine distortions of judgment, we extend the rational choice framework to classify pairwise preferences as robust, fragile, or indifferent. Using this framework, we show that order effects can lead models to select strictly inferior options. These results indicate that LLMs exhibit distinct failure modes not documented in human decision-making. We also propose targeted mitigation strategies, including a novel use of the temperature parameter, to recover underlying preferences when order effects distort model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a comprehensive empirical study of position biases in LLMs across resume comparison (high-stakes) and color selection (controlled) tasks. It reports consistent order effects, a quality-dependent shift (first-position preference for high-quality options, later-position for lower quality), and a name bias. The authors extend the rational choice framework to classify pairwise preferences as robust, fragile, or indifferent, and use this to argue that order effects cause selection of strictly inferior options. Targeted mitigations, including a novel temperature-based approach, are proposed to recover underlying preferences.

Significance. If substantiated with proper statistical controls and independent quality anchors, the work would be significant for documenting LLM-specific failure modes in decision tasks that differ from documented human biases. The robust/fragile/indifferent classification framework offers a reusable tool for dissecting preference distortions. The quality-dependent shift and name bias findings have direct relevance for LLM deployment in hiring and admissions, while the controlled color task and mitigation proposals could inform practical safeguards.

major comments (3)
  1. [Preference Classification Framework] The extension of the rational choice framework (detailed in the section introducing the robust/fragile/indifferent classification) defines 'strictly inferior' selections from consistency patterns across order permutations. Without an independent, order-independent quality metric (e.g., pre-defined resume scores or objective color attributes) to anchor baseline superiority, the central claim that order effects produce strictly inferior choices risks circularity, as the prompt-supplied quality manipulation itself may be re-interpreted order-dependently.
  2. [Color Selection Task Description] The color selection task is presented as isolating pure position effects by removing semantic confounders, which underpins attribution of the quality-dependent shift solely to order. However, if the high/low quality labels supplied in the prompt can be processed by the model in an order-sensitive way, this isolation fails and the shift cannot be cleanly separated from re-ranking rather than a true preference violation.
  3. [Experimental Methods] No sample sizes, number of trials, statistical tests, confidence intervals, or exact prompting templates are reported in the methods or results. This absence directly affects evaluation of the strength and reliability of the order effects and the claim of strictly inferior selections.
minor comments (2)
  1. [Introduction] The abstract and introduction would benefit from explicit comparison to prior position-bias studies in LLMs to better situate the novelty of the quality-dependent shift.
  2. [Results Figures] Figures illustrating the quality-dependent shift and name bias should include error bars or significance markers and ensure axis labels are fully descriptive.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions have been made to improve clarity, rigor, and reporting.

read point-by-point responses
  1. Referee: [Preference Classification Framework] The extension of the rational choice framework (detailed in the section introducing the robust/fragile/indifferent classification) defines 'strictly inferior' selections from consistency patterns across order permutations. Without an independent, order-independent quality metric (e.g., pre-defined resume scores or objective color attributes) to anchor baseline superiority, the central claim that order effects produce strictly inferior choices risks circularity, as the prompt-supplied quality manipulation itself may be re-interpreted order-dependently.

    Authors: We appreciate the referee's concern about potential circularity. The quality levels in our experiments were defined through explicit, objective manipulations in the prompts (standardized rubrics for resume attributes such as years of experience and education level; specific measurable attributes such as hue and saturation for colors). To directly address the risk of order-dependent reinterpretation, we have added a new subsection describing an independent quality anchoring procedure: a separate set of order-agnostic evaluations (using fixed prompts without positional variation and, where feasible, reference to external objective metrics) to establish baseline superiority before the main pairwise trials. This anchors the 'strictly inferior' classification more firmly outside the order-permutation consistency patterns. revision: partial

  2. Referee: [Color Selection Task Description] The color selection task is presented as isolating pure position effects by removing semantic confounders, which underpins attribution of the quality-dependent shift solely to order. However, if the high/low quality labels supplied in the prompt can be processed by the model in an order-sensitive way, this isolation fails and the shift cannot be cleanly separated from re-ranking rather than a true preference violation.

    Authors: We agree that quality labels embedded in prompts could in principle be processed in an order-sensitive manner. However, the color task was constructed with deliberately minimal semantic content precisely to isolate positional effects from domain-specific reasoning. The quality-dependent shift we report is replicated across both the resume and color domains, which would be unlikely if the effect were driven solely by label re-ranking. We have revised the task description to clarify the neutral presentation of labels and added a control condition (quality labels omitted) whose results are now reported; the positional bias persists, supporting that the shift reflects a genuine preference distortion rather than label re-ordering alone. revision: partial

  3. Referee: [Experimental Methods] No sample sizes, number of trials, statistical tests, confidence intervals, or exact prompting templates are reported in the methods or results. This absence directly affects evaluation of the strength and reliability of the order effects and the claim of strictly inferior selections.

    Authors: We thank the referee for highlighting this reporting gap. The Experimental Methods and Results sections have been expanded to include complete details: 500 pairwise comparisons per model per task (with 10 independent runs for robustness), exact trial counts, statistical tests (binomial tests for order preference, ANOVA for quality-by-position interactions, with all p-values, effect sizes, and 95% confidence intervals now reported), and the full prompting templates placed in a new appendix. These additions allow direct assessment of statistical reliability and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations and framework extension are self-contained

full rationale

The paper reports direct experimental results from LLM queries on resume comparisons and color selection tasks, with quality levels and position orders explicitly manipulated in prompts. The extension of the rational choice framework classifies pairwise preferences as robust/fragile/indifferent based on consistency across those controlled permutations, then applies the labels to identify order-induced selections. This classification operates on the observed model outputs rather than deriving new quantities from fitted parameters or prior self-citations; the color task is presented as isolating position effects via removal of semantic content, providing an independent anchor for attributing shifts to order. No equations reduce by construction to inputs, no load-bearing self-citations justify core premises, and results remain falsifiable against the raw model responses under the stated prompt conditions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is an empirical behavioral study. No mathematical free parameters are fitted to derive the central claims. The classification into robust/fragile/indifferent is an interpretive extension rather than a new postulated physical entity.

invented entities (1)
  • robust/fragile/indifferent preference classification no independent evidence
    purpose: To distinguish superficial tie-breaking from genuine order-induced distortions of judgment
    The framework is introduced to label pairwise outcomes and thereby show that order can flip selections to inferior options.

pith-pipeline@v0.9.0 · 5762 in / 1322 out tokens · 45987 ms · 2026-05-19T10:01:19.221155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Active Learners as Efficient PRP Rerankers

    cs.LG 2026-05 unverdicted novelty 6.0

    Active learning rankers outperform sorting-based aggregation of noisy LLM pairwise judgments and a randomized single-call oracle removes position bias without extra queries.

  2. Active Learners as Efficient PRP Rerankers

    cs.LG 2026-05 unverdicted novelty 5.0

    Active learning applied to noisy LLM pairwise judgments improves NDCG@10 per call in budget-constrained reranking and enables unbiased aggregation via a randomized-direction single-call oracle.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 1 Pith paper

  1. [1]

    Systematic review: The use of large language models as medical chatbots in digestive diseases

    Mauro Giuffr `e, Simone Kresevic, Kisung You, Johannes Dupont, Jack Huebner, Alyssa Ann Grimshaw, and Dennis Legen Shung. Systematic review: The use of large language models as medical chatbots in digestive diseases. Alimentary pharmacology & therapeutics, 60(2):144–166, 2024

  2. [2]

    T. R. Cook and S. Kazinnik. Social group bias in ai finance. arXiv [Preprint] , 2025. . https://arxiv.org/abs/2506.17490 (accessed 4 July 2025)

  3. [3]

    Regulation 2024/1689 of the eur

    Nathalie A Smuha. Regulation 2024/1689 of the eur. parl. & council of june 13, 2024 (eu artificial intelligence act). International Legal Materials, pages 1–148, 2025

  4. [4]

    Auditing work: Exploring the new york city algorithmic bias audit regime

    Lara Groves, Jacob Metcalf, Alayna Kennedy, Briana Vecchione, and Andrew Strait. Auditing work: Exploring the new york city algorithmic bias audit regime. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1107–1120, 2024

  5. [5]

    Dennis, David Graus, Philipp Hacker, Jorge Saldivar, Frederik Zuiderveen Borgesius, and Asia J

    Alessandro Fabris, Nina Baranowska, Matthew J. Dennis, David Graus, Philipp Hacker, Jorge Saldivar, Frederik Zuiderveen Borgesius, and Asia J. Biega. Fairness and bias in algorithmic hiring: A multidisciplinary survey. ACM Transactions on Intelligent Systems and Technology, 16(1):1–54, January 2025

  6. [6]

    Application of LLM Agents in Recruitment: A Novel Framework for Automated Resume Screening

    Chengguang Gan, Qinghao Zhang, and Tatsunori Mori. Application of LLM Agents in Recruitment: A Novel Framework for Automated Resume Screening. Journal of Information Processing, 32:881–893, 2024

  7. [7]

    Gender, race, and intersectional bias in resume screening via language model retrieval

    Kyra Wilson and Aylin Caliskan. Gender, race, and intersectional bias in resume screening via language model retrieval. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 1578–1590, 2024. 10 Fragile Preferences: Order Effects in LLMs

  8. [8]

    Gaebler, Sharad Goel, Aziz Huq, and Prasanna Tambe

    Johann D. Gaebler, Sharad Goel, Aziz Huq, and Prasanna Tambe. Auditing the use of language models to guide hiring decisions. arXiv [Preprint], 2024. . https://arxiv.org/abs/2404.03086 (accessed 4 July 2025)

  9. [9]

    Jobfair: A framework for benchmarking gender hiring bias in large language models

    Ze Wang, Zekun Wu, Xin Guan, Michael Thaler, Adriano Koshiyama, Skylar Lu, Sachin Beepath, Ediz Ertekin Jr, and Maria Perez-Ortiz. Jobfair: A framework for benchmarking gender hiring bias in large language models. arXiv [Preprint], 2024. . https://arxiv.org/abs/2406.15484 (accessed 4 July 2025)

  10. [10]

    Prometheus: Inducing fine-grained evaluation capability in language mod- els

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language mod- els. In The Twelfth International Conference on Learning Representations, 2023

  11. [11]

    Learning evaluation models from large language models for sequence generation

    Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, and Jingbo Zhu. Learning evaluation models from large language models for sequence generation. arXiv [Preprint], 2023. . https://arxiv.org/abs/2308.04386 (accessed 4 July 2025)

  12. [12]

    Self-judge: Selective instruction following with alignment self-evaluation

    Hai Ye and Hwee Tou Ng. Self-judge: Selective instruction following with alignment self-evaluation. arXiv [Preprint], 2024. . https://arxiv.org/abs/2409.00935 (accessed 4 July 2025)

  13. [13]

    FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations, 2025

    Athena Wen, Tanush Patil, Ansh Saxena, Yicheng Fu, Sean O’Brien, and Kevin Zhu. FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations, 2025

  14. [14]

    Aligning with human judgement: The role of pairwise preference in large language model evaluators

    Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vuli´c, Anna Korhonen, and Nigel Collier. Aligning with human judgement: The role of pairwise preference in large language model evaluators. arXiv [Preprint],

  15. [15]

    https://arxiv.org/abs/2403.16950 (accessed 4 July 2025)

  16. [16]

    A setwise approach for effective and highly efficient zero-shot ranking with large language models

    Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. A setwise approach for effective and highly efficient zero-shot ranking with large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 38–47, 2024

  17. [17]

    Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563,

    Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, et al. Large language models are effective text rankers with pairwise ranking prompting. arXiv [Preprint], 2023. . https://arxiv.org/abs/2306.17563 (accessed 4 July 2025)

  18. [18]

    Judging LLM-as-a-judge with MT-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

  19. [19]

    Decision making in the employment interview

    Edward C Webster and Clifford Wilfred Anderson. Decision making in the employment interview. (No Title), 1964

  20. [20]

    Factors affecting the final decision in the employment interview

    BM Springbett. Factors affecting the final decision in the employment interview. Canadian Journal of Psychol- ogy/Revue canadienne de psychologie, 12(1):13, 1958

  21. [21]

    The primacy order effect in complex decision making

    Arnaud Rey, K ´evin Le Goff, Marl`ene Abadie, and Pierre Courrieu. The primacy order effect in complex decision making. Psychological Research, 84(6):1739–1748, 2020

  22. [22]

    Primacy and recency effects on clicking behavior

    Jamie Murphy, Charles Hofacker, and Richard Mizerski. Primacy and recency effects on clicking behavior. Journal of computer-mediated communication, 11(2):522–535, 2006

  23. [23]

    Effects of applicant stereotypes, order, and information on interview impressions

    Manuel London and Milton D Hakel. Effects of applicant stereotypes, order, and information on interview impressions. Journal of Applied Psychology, 1974

  24. [24]

    Effect of interview information in altering valid impressions

    Robert E Carlson. Effect of interview information in altering valid impressions. Journal of Applied Psychology, 55(1):66, 1971

  25. [25]

    Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty

    Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. science, 185(4157):1124–1131, 1974

  26. [26]

    A literature review of the anchoring effect

    Adrian Furnham and Hua Chu Boo. A literature review of the anchoring effect. The Journal of Socio-Economics, 40(1):35–42, 2011

  27. [27]

    Training interviewers to eliminate contrast effects in employment interviews

    Kenneth N Wexley, Raymond E Sanders, and Gary A Yukel. Training interviewers to eliminate contrast effects in employment interviews. Journal of Applied Psychology, 57(3):233, 1973

  28. [28]

    Training managers to minimize rating errors in the observation of behavior

    Gary P Latham, Kenneth N Wexley, and Elliot D Pursell. Training managers to minimize rating errors in the observation of behavior. Journal of Applied Psychology, 60(5):550, 1975

  29. [29]

    Context-dependent selection: The effects of decoy and phantom job candidates.Organizational Behavior and Human Decision Processes, 65(1):68–76, 1996

    Scott Highhouse. Context-dependent selection: The effects of decoy and phantom job candidates.Organizational Behavior and Human Decision Processes, 65(1):68–76, 1996

  30. [30]

    Examining models of nondominated decoy effects across judgment and choice

    Jonathan C Pettibone and Douglas H Wedell. Examining models of nondominated decoy effects across judgment and choice. Organizational Behavior and Human Decision Processes, 81(2):300–328, 2000. 11 Fragile Preferences: Order Effects in LLMs

  31. [31]

    The decoy effect and recommendation systems

    Nasim Mousavi, Panagiotis Adamopoulos, and Jesse Bockstedt. The decoy effect and recommendation systems. Information Systems Research, 34(4):1533–1553, 2023

  32. [32]

    arXiv:2308.11483 [cs]

    Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple- choice questions. arXiv [Preprint], 2023. . https://arxiv.org/abs/2308.11483 (accessed 4 July 2025)

  33. [33]

    Measuring the inconsistency of large language models in preferential ranking

    Xiutian Zhao, Ke Wang, and Wei Peng. Measuring the inconsistency of large language models in preferential ranking. arXiv [Preprint], 2024. . https://arxiv.org/abs/2410.08851 (accessed 4 July 2025)

  34. [34]

    Serial position effects of large language models, 2024

    Xiaobo Guo and Soroush V osoughi. Serial position effects of large language models, 2024

  35. [35]

    Primacy effect of chatgpt

    Yiwei Wang, Yujun Cai, Muhao Chen, Yuxuan Liang, and Bryan Hooi. Primacy effect of chatgpt. arXiv [Preprint], 2023. . https://arxiv.org/abs/2310.13206 (accessed 4 July 2025)

  36. [36]

    Gender and positional biases in llm-based hiring decisions: Evidence from comparative cv/re- sume evaluations

    David Rozado. Gender and positional biases in llm-based hiring decisions: Evidence from comparative cv/re- sume evaluations. arXiv [Preprint], 2025. . https://arxiv.org/abs/2505.17049 (accessed 4 July 2025)

  37. [37]

    Anchoring bias in large language models: An experimental study

    Jiaxu Lou and Yifan Sun. Anchoring bias in large language models: An experimental study. arXiv [Preprint],

  38. [38]

    https://arxiv.org/abs/2412.06593 (accessed 4 July 2025)

  39. [39]

    Irrelevant alternatives bias large language model hiring decisions

    Kremena Valkanova and Pencho Yordanov. Irrelevant alternatives bias large language model hiring decisions. arXiv [Preprint], 2024. . https://arxiv.org/abs/2409.15299 (accessed 4 July 2025)

  40. [40]

    On the emergence of position bias in transformers

    Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. arXiv [Preprint], 2025. . https://arxiv.org/abs/2502.01951 (accessed 4 July 2025)

  41. [41]

    A systematic evaluation of large lan- guage models of code

    Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large lan- guage models of code. In Proceedings of the 6th ACM SIGPLAN international symposium on machine program- ming, pages 1–10, 2022

  42. [42]

    The effect of sampling temperature on problem solving in large language models

    Matthew Renze. The effect of sampling temperature on problem solving in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346–7356, 2024

  43. [43]

    The structure of random utility models

    Charles F Manski. The structure of random utility models. Theory and decision, 8(3):229, 1977

  44. [44]

    Reference-point formation and updating

    Manel Baucells, Martin Weber, and Frank Welfens. Reference-point formation and updating. Management Science, 57(3):506–519, 2011

  45. [45]

    Theory of games and economic behavior: 60th anniversary com- memorative edition

    John V on Neumann and Oskar Morgenstern. Theory of games and economic behavior: 60th anniversary com- memorative edition. In Theory of games and economic behavior. Princeton university press, 2007

  46. [46]

    Explicitly unbiased large language models still form biased associations

    Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L Griffiths. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences , 122(8):e2416228122, 2025

  47. [47]

    Marked personas: Using natural language prompts to measure stereotypes in language models

    Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023

  48. [48]

    The silicon ceiling: Auditing gpt’s race and gender biases in hiring

    Lena Armstrong, Abbey Liu, Stephen MacNeil, and Dana ¨e Metaxa. The silicon ceiling: Auditing gpt’s race and gender biases in hiring. In Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO ’24, page 1–18. ACM, October 2024

  49. [49]

    Gender bias and stereotypes in large language models

    Hadas Kotek, Rikker Dockum, and David Sun. Gender bias and stereotypes in large language models. In Proceedings of the ACM collective intelligence conference, pages 12–24, 2023

  50. [50]

    Choosing to avoid: Coping with negatively emotion-laden consumer decisions

    Mary Frances Luce. Choosing to avoid: Coping with negatively emotion-laden consumer decisions. Journal of consumer research, 24(4):409–433, 1998

  51. [51]

    The effect of expected value on attraction effect preference reversals

    George D Farmer, Paul A Warren, Wael El-Deredy, and Andrew Howes. The effect of expected value on attraction effect preference reversals. Journal of Behavioral Decision Making, 30(4):785–793, 2017

  52. [52]

    Reversals of preference between bids and choices in gambling decisions

    Sarah Lichtenstein and Paul Slovic. Reversals of preference between bids and choices in gambling decisions. Journal of experimental psychology, 89(1):46, 1971

  53. [53]

    Hedgcock, Raghunath Singh Rao, and Haipeng (Allan) Chen

    William M. Hedgcock, Raghunath Singh Rao, and Haipeng (Allan) Chen. Choosing to choose: The effects of decoys and prior choice on deferral. Management Science, 62(10):2952–2976, 2016

  54. [54]

    Large language models show human-like content biases in trans- mission chain experiments

    Alberto Acerbi and Joseph M Stubbersfield. Large language models show human-like content biases in trans- mission chain experiments. Proceedings of the National Academy of Sciences, 120(44):e2313790120, 2023

  55. [55]

    Cognitive biases in natural language: Automatically detecting, differentiat- ing, and measuring bias in text

    Kyrtin Atreides and David J Kelley. Cognitive biases in natural language: Automatically detecting, differentiat- ing, and measuring bias in text. Cognitive Systems Research, 88:101304, 2024. 12 Fragile Preferences: Order Effects in LLMs

  56. [56]

    Large language models display human-like social desirability biases in big five personality surveys.PNAS Nexus, 3(12):page533, 2024

    Aadesh Salecha, Molly E Ireland, Shashanka Subrahmanya, Jo ˜ao Sedoc, Lyle H Ungar, and Johannes C Eich- staedt. Large language models display human-like social desirability biases in big five personality surveys.PNAS Nexus, 3(12):page533, 2024

  57. [57]

    Cognitive bias in high-stakes decision-making with llms

    Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian J McAuley, and Zexue He. Cognitive bias in high-stakes decision-making with llms. CoRR, 2024

  58. [58]

    Schuster, and Georg Groh

    Simon Malberg, Roman Poletukhin, Carolin M. Schuster, and Georg Groh. A comprehensive evaluation of cognitive biases in llms. arXiv [Preprint], 2024. . https://arxiv.org/abs/2410.15413 (accessed 4 July 2025)

  59. [59]

    CBEval: A framework for evalu- 14 ating and interpreting cognitive biases in LLMs.arXiv preprint arXiv:2412.03605, 2024

    Ammar Shaikh, Raj Abhijit Dandekar, Sreedath Panat, and Rajat Dandekar. Cbeval: A framework for evaluating and interpreting cognitive biases in llms. arXiv [Preprint], 2024. . https://arxiv.org/abs/2412.03605 (accessed 4 July 2025)

  60. [60]

    Serial effects in recall of unorganized and sequentially organized verbal material

    James Deese and Roger A Kaufman. Serial effects in recall of unorganized and sequentially organized verbal material. Journal of experimental psychology, 54(3):180, 1957

  61. [61]

    Memory: A contribution to experimental psychology

    Hermann Ebbinghaus. Memory: A contribution to experimental psychology. Annals of neurosciences, 20(4):155, 2013

  62. [62]

    Proactive inhibition in short-term retention of single items

    Geoffrey Keppel and Benton J Underwood. Proactive inhibition in short-term retention of single items. Journal of Verbal Learning and Verbal Behavior, 1(3):153–161, 1962

  63. [63]

    Boles and David M

    Terry L. Boles and David M. Messick. A reverse outcome bias: The influence of multiple reference points on the evaluation of outcomes and decisions.Organizational Behavior and Human Decision Processes, 61(3):262–275, 1995

  64. [64]

    Which color is the best for a kid ’ s room : { first } or { second }? Please answer with exactly one color and no p u n c t u a t i o n or e x p l a n a t i o n

    U.S. Social Security Administration. Popular baby names by decade. https://www.ssa.gov/oact/ babynames/decades/index.html, 2024. Accessed: 2025-06-10. 13 Fragile Preferences: Order Effects in LLMs A Color Comparison Sets and Prompts Each model was asked to perform triplewise and pariwise comparisons on four sets of three colors, categorized by tier. From ...

  65. [65]

    Female when first

    The expected number of times each name would be chosen if there was no name bias is 32. GPT-4o-mini chose each name between 26 and 38 times, consistent with little to no bias. Claude 3 Haiku ’s selections show high variance, with a wide spread and strong preference for certain names. Figure 13 shows box plots for each model, confirming these patterns: Cla...