pith. sign in

arxiv: 2606.20527 · v1 · pith:AFUTLOUDnew · submitted 2026-06-18 · 💻 cs.CL · cs.CV

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

Pith reviewed 2026-06-26 17:18 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords social biasmultimodal large language modelsvisual attributesbenchmarkattribute-level effectsMLLMsphotorealistic images
0
0 comments X

The pith

Fifteen visual attributes account for nearly 80 percent of social bias variation in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates StylisticBias, a benchmark that starts with 500 photorealistic base faces and produces roughly 50 single-attribute variants per face for a total of 25,000 images. Identity stays fixed while one visual cue changes at a time, allowing direct measurement of how specific appearance signals shift model outputs on 25 binary social judgments. Testing six MLLMs shows age and body type produce the largest identity-level effects, fashion style produces the largest attribute-level shifts, and roughly 15 attributes together explain 80 percent of total variation. Effects are strongest when the judgment task is semantically matched to the changed attribute, such as socioeconomic judgments responding to clothing cues.

Core claim

StylisticBias isolates the contribution of individual visual attributes by generating photorealistic images that vary only one attribute while holding identity constant. Across six MLLMs and 25 judgment scenarios, the resulting measurements show that bias is not distributed uniformly: age and body type dominate identity-level effects, fashion style drives the largest attribute-level shifts, and approximately 15 attributes account for nearly 80 percent of the observed variation. Sensitivity is highest for judgments that are semantically aligned with the altered appearance cue.

What carries the argument

StylisticBias benchmark that produces single-attribute photorealistic variations on fixed-identity faces to isolate the effect of each visual cue on model social judgments.

If this is right

  • A small number of targeted interventions on the dominant attributes could address most of the measured bias.
  • Models exhibit stronger bias shifts when the social judgment is semantically related to the visual attribute changed.
  • Age and body type produce larger identity-level effects than other attributes, while fashion style produces larger attribute-level effects.
  • The concentration result indicates that bias arises primarily from a limited set of appearance-based associations rather than from diffuse or identity-specific sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the concentration finding generalizes, bias mitigation could be achieved by auditing or constraining model behavior on responses triggered by those 15 cues instead of broad retraining.
  • The same fixed-identity variation design could be applied to test whether text-only models or other modalities exhibit similar concentration of bias drivers.
  • The results suggest models are picking up and amplifying human visual stereotypes that link specific appearance signals to social categories, which could be tested by measuring consistency of judgments across different demographic base faces.

Load-bearing premise

The photorealistic image generation produces faithful single-attribute variations that do not introduce unintended artifacts or correlations confounding the measured bias effects.

What would settle it

Re-running the judgments on a new set of images generated by an independent method that alters multiple attributes simultaneously or introduces different correlations would show whether the concentration of 80 percent of variation in 15 attributes still holds.

Figures

Figures reproduced from arXiv: 2606.20527 by Jana Diesner, Nafiseh Nikeghbal, Samantha Dalal, Shaghayegh Kolli, Timo Cavelius.

Figure 1
Figure 1. Figure 1: Benchmark construction and evaluation. (1) Benchmark Generation: A Cartesian product of four demographic attributes yields 90 configurations from which 500 synthetic base faces Xb are generated. Each base face receives ∼50 single-attribute variations, yielding ∼25,000 images Xv = {v(x) | x ∈ Xb}. (2) Benchmark Evaluation: Six MLLMs perform binary forced-choice judgments across 25 scenarios under 3 seeds an… view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative |SBS| by attribute, sorted by mag￾nitude. 15 attributes reach the 80% threshold. is 5.5× stronger than Slicked-back (0.0098, one￾sided WSRT, BH-corrected over n=5 pairs, p < 2.9×10−47). This asymmetry mirrors negativity bias in human social cognition (Zebrowitz and Montepare, 2008) and has a direct implication for bias auditing: evaluations that focus only on posi￾tive appearance cues will syste… view at source ↗
Figure 4
Figure 4. Figure 4: Mean SBS (x-axis) vs. mean |SBS| (y-axis) for each of the 25 scenarios, with bootstrap 95% CI (face-level). conomic and appearance-related scenarios occupy a distinct high-magnitude region while all other categories cluster near the origin. We call this se￾mantic alignment bias: models rely most heavily on appearance cues when the queried judgment is culturally associated with visible appearance. Across mo… view at source ↗
Figure 5
Figure 5. Figure 5: Gemma-3 vs. Gemma-4 mean ∆ per scenario, colored by judgment category (r = 0.75, slope = 0.39). ble. Fashion |SBS| spans 0.088 (Gemma-4) to 0.176 (Gemma-3), with category ranking preserved across all six architectures. The Gemma family provides the clearest within￾architecture comparison ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template used to synthesize a demo [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template used to generate controlled [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template used to generate controlled [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Forced-choice prompt template used for the bias evaluation. The placeholders ⟨option_a⟩ and ⟨option_b⟩ are filled with a scenario pair (e.g., closed￾minded vs. open-minded); the framing is designed to discourage refusal or hedging. judgment stage; the dataset is not affected. Computational scale of the unreduced judgment. The complete variation grid grows combinatori￾ally with the number of attribute value… view at source ↗
Figure 10
Figure 10. Figure 10: Example base faces and representative demographic and stylistic variations used in the benchmark. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces StylisticBias, a controlled benchmark using 500 photorealistic base faces and ~50 single-attribute variations per face (~25K images total) to isolate the effects of specific visual cues on social judgments. Six MLLMs are evaluated on 25 binary judgment scenarios. The central claims are that age and body type dominate identity-level effects, fashion style drives the largest attribute-level shifts, ~15 attributes explain nearly 80% of total variation (showing bias concentration), and sensitivity is strongest for semantically aligned judgments (e.g., socioeconomic). The benchmark, code, and dataset are released.

Significance. If the single-attribute isolation is validated, the work offers a fine-grained empirical map of visual cue contributions to MLLM biases, demonstrating concentration in a small attribute set that could guide targeted interventions. The public release of code and dataset is a clear strength for reproducibility and community use.

major comments (1)
  1. [Methods / image generation] Methods / image generation section: The claim that ~15 attributes account for ~80% of variation (and all per-attribute effect sizes) rests on the assumption that each of the ~50 variations per base face alters only the intended cue. The abstract states the design “keeps identity fixed and changes one visual attribute at a time,” yet no quantitative validation (e.g., auxiliary attribute classifiers, human ratings of unchanged features, or metrics on lighting/expression/background stability) is reported to confirm absence of correlated artifacts. This directly affects attribution of measured shifts and the subsequent concentration result.
minor comments (1)
  1. [Abstract] Abstract: exact counts (e.g., precise number of variations and total images) would improve precision over the repeated use of “about.”

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of validating the single-attribute isolation in our image generation process. This is a substantive methodological point that strengthens the paper. We address it directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods / image generation] Methods / image generation section: The claim that ~15 attributes account for ~80% of variation (and all per-attribute effect sizes) rests on the assumption that each of the ~50 variations per base face alters only the intended cue. The abstract states the design “keeps identity fixed and changes one visual attribute at a time,” yet no quantitative validation (e.g., auxiliary attribute classifiers, human ratings of unchanged features, or metrics on lighting/expression/background stability) is reported to confirm absence of correlated artifacts. This directly affects attribution of measured shifts and the subsequent concentration result.

    Authors: We agree that the absence of explicit quantitative validation is a limitation in the current manuscript. While the generation pipeline was designed via controlled prompting to modify only the target attribute while holding identity, lighting, expression, and background fixed, we did not report auxiliary checks such as attribute classifiers, human ratings, or stability metrics. In the revised version we will add a dedicated validation subsection that reports: (1) results from auxiliary attribute classifiers measuring unintended changes across non-target attributes, (2) human ratings on a sampled subset assessing whether other visual features remained stable, and (3) quantitative metrics on lighting, expression, and background consistency. These additions will directly support the attribution of measured shifts and the concentration finding. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

The paper describes an empirical study that generates controlled image variations, runs MLLM evaluations on binary judgments, and reports observed patterns such as the concentration of variance in ~15 attributes. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on the generated dataset and model outputs rather than reducing to prior inputs by construction. The image-generation assumption is a methodological limitation but does not create circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; main domain assumption concerns validity of controlled image generation.

axioms (1)
  • domain assumption Photorealistic image generation can isolate single visual attributes without introducing confounding artifacts
    Invoked by the benchmark design that keeps identity fixed while changing one cue at a time.

pith-pipeline@v0.9.1-grok · 5798 in / 1144 out tokens · 33052 ms · 2026-06-26T17:18:35.905648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 23 canonical work pages

  1. [1]

    , title =

    Fiske, Susan T. , title =. Current Directions in Psychological Science , volume =. 2018 , publisher =. doi:10.1177/0963721417738825 , url =

  2. [2]

    CHI Workshop on Between and Beyond: Designing for Identity Complexity in HCI , year =

    Scheuerman, Morgan Klaus , title =. CHI Workshop on Between and Beyond: Designing for Identity Complexity in HCI , year =

  3. [3]

    Proceedings of the 1st Conference on Fairness, Accountability and Transparency , pages =

    Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , author =. Proceedings of the 1st Conference on Fairness, Accountability and Transparency , pages =. 2018 , editor =

  4. [4]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  5. [5]

    2023 , eprint=

    Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets , author=. 2023 , eprint=

  6. [6]

    Stable Bias: Evaluating Societal Representations in Diffusion Models , url =

    Luccioni, Sasha and Akiki, Christopher and Mitchell, Margaret and Jernite, Yacine , booktitle =. Stable Bias: Evaluating Societal Representations in Diffusion Models , url =

  7. [7]

    2024 , isbn =

    Chinchure, Aditya and Shukla, Pushkar and Bhatt, Gaurav and Salij, Kiri and Hosanagar, Kartik and Sigal, Leonid and Turk, Matthew , title =. 2024 , isbn =. doi:10.1007/978-3-031-72986-7_25 , booktitle =

  8. [8]

    V*: Guided visual search as a core mechanism in multimodal llms

    D'Incà, Moreno and Peruzzo, Elia and Mancini, Massimiliano and Xu, Dejia and Goe, Vidit and Xu, Xingqian and Wang, Zhangyang and Shi, Humphrey and Sebe, Nicu , booktitle=. OpenBias: Open-Set Bias Detection in Text-to-Image Generative Models , year=. doi:10.1109/CVPR52733.2024.01162 , url =

  9. [9]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Guimard, Quentin and D'Inc\`a, Moreno and Mancini, Massimiliano and Ricci, Elisa , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

  10. [10]

    S tereo M ap: Quantifying the Awareness of Human-like Stereotypes in Large Language Models

    Jeoung, Sullam and Ge, Yubin and Diesner, Jana. S tereo M ap: Quantifying the Awareness of Human-like Stereotypes in Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.752

  11. [11]

    ModSCAN : Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities

    Jiang, Yukun and Li, Zheng and Shen, Xinyue and Liu, Yugeng and Backes, Michael and Zhang, Yang. ModSCAN : Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.713

  12. [12]

    J ob F air: A Framework for Benchmarking Gender Hiring Bias in Large Language Models

    Wang, Ze and Wu, Zekun and Guan, Xin and Thaler, Michael and Koshiyama, Adriano and Lu, Skylar and Beepath, Sachin and Ertekin, Ediz and Perez-Ortiz, Maria , year=. JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models , url=. doi:10.18653/v1/2024.findings-emnlp.184 , booktitle=

  13. [13]

    Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

    Beauty and the bias: Exploring the impact of attractiveness on multimodal large language models , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=. 2025 , url=

  14. [14]

    International Journal of Innovative Research and Advanced Studies , volume=

    The relationship between clothes and first impressions: Benefits and adverse effects on the individual , author=. International Journal of Innovative Research and Advanced Studies , volume=. 2016 , url=

  15. [15]

    and Montepare, Joann M

    Zebrowitz, Leslie A. and Montepare, Joann M. , title =. Social and Personality Psychology Compass , volume =. 2008 , doi =

  16. [16]

    and Ewan, Ken and Earthy, Jillian and Lefave, Sarah and Goldberg, Heather , title =

    Paunonen, Sampo V. and Ewan, Ken and Earthy, Jillian and Lefave, Sarah and Goldberg, Heather , title =. Journal of Personality , volume =. doi:https://doi.org/10.1111/1467-6494.00065 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-6494.00065 , year =

  17. [17]

    European Psychologist , year=

    The influence of facial piercings and observer personality on perceptions of physical attractiveness and intelligence , author=. European Psychologist , year=

  18. [18]

    Clothing and Textiles Research Journal , year=

    Effect of Garment Style on the Perception of Personal Traits , author=. Clothing and Textiles Research Journal , year=

  19. [19]

    Journal of Computational Social Science , year=

    Psychological trait inferences from women’s clothing: human and machine prediction , author=. Journal of Computational Social Science , year=

  20. [20]

    Social and applied aspects of perceiving faces , pages=

    Determinants and consequences of facial aesthetics , author=. Social and applied aspects of perceiving faces , pages=. 2013 , publisher=

  21. [21]

    Quarterly Journal of Experimental Psychology , volume=

    Internal facial features are signals of personality and health , author=. Quarterly Journal of Experimental Psychology , volume=. 2010 , publisher=

  22. [22]

    GNOSI: An Interdisciplinary Journal of Human Theory and Praxis , volume=

    The discovery and evolution of the big five of personality traits: A historical review , author=. GNOSI: An Interdisciplinary Journal of Human Theory and Praxis , volume=. 2021 , url=

  23. [23]

    Personality and individual differences , volume=

    The Big Five, everyday contexts and activities, and affective experience , author=. Personality and individual differences , volume=. 2019 , publisher=

  24. [24]

    Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , number=

    Identifying implicit social biases in vision-language models , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , number=. 2024 , url =

  25. [25]

    VLS tereo S et: A Study of Stereotypical Bias in Pre-trained Vision-Language Models

    Zhou, Kankan and Lai, Eason and Jiang, Jing. VLS tereo S et: A Study of Stereotypical Bias in Pre-trained Vision-Language Models. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2022. doi:10.186...

  26. [26]

    Benchmarking cognitive biases in large language models as evaluators

    Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

  27. [27]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  28. [28]

    2024 , eprint=

    Pixtral 12B , author=. 2024 , eprint=

  29. [29]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  30. [30]

    2024 , url=

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. 2024 , url=

  31. [31]

    and Javaheripi, Mojan and Kauffmann, Piero and Lee, James R

    Abdin, Marah I and Aneja, Jyoti and Behl, Harkirat and Bubeck, Sébastien and Eldan, Ronen and Gunasekar, Suriya and Harrison, Michael and Hewett, Russell J. and Javaheripi, Mojan and Kauffmann, Piero and Lee, James R. and Lee, Yin Tat and Li, Yuanzhi and Liu, Weishung and Mendes, Caio CT and Nguyen, Anh and Price, Eric and de Rosa, Gustavo and Saarikivi, ...

  32. [32]

    2024 , eprint=

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , eprint=

  33. [33]

    2025 , eprint=

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

  34. [34]

    Pi, Renjie and Bai, Haoping and Chen, Qibin and Wang, Xiaoming Simon and Shan, Jiulong and Liu, Xiaojiang and Cao, Meng. MR . Judge: Multimodal Reasoner as a Judge. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1021

  35. [35]

    2025 , eprint=

    FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment , author=. 2025 , eprint=

  36. [36]

    2024 , url=

    Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=

  37. [37]

    Imagen: Text-to-Image Models (including Imagen 4) , year =

  38. [38]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  39. [39]

    Psychological science , volume=

    First impressions: Making up your mind after a 100-ms exposure to a face , author=. Psychological science , volume=. 2006 , publisher=

  40. [40]

    Social Attributions from Faces: Determinants, Consequences, Accuracy, and Functional Significance , volume =

    Todorov, Alexander and Olivola, Christopher and Dotsch, Ron and Mende-Siedlecki, Peter , year =. Social Attributions from Faces: Determinants, Consequences, Accuracy, and Functional Significance , volume =. Annual review of psychology , doi =

  41. [41]

    Proceedings of the National Academy of Sciences , year=

    The functional basis of face evaluation , author=. Proceedings of the National Academy of Sciences , year=

  42. [42]

    Memory and cognition , volume=

    Appearance-based inferences bias source memory , author=. Memory and cognition , volume=. 2012 , publisher=

  43. [43]

    Journal of Fashion Marketing and Management: An International Journal , volume=

    The influence of clothing on first impressions: Rapid and positive responses to minor changes in male attire , author=. Journal of Fashion Marketing and Management: An International Journal , volume=. 2013 , publisher=

  44. [44]

    The Woman Worked as a Babysitter: On Biases in Language Generation

    Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun. The Woman Worked as a Babysitter: On Biases in Language Generation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1339

  45. [45]

    M., and Bowman, S

    Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel. BBQ : A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.165

  46. [46]

    Nature Computational Science , year =

    Generative language models exhibit social identity biases , author =. Nature Computational Science , year =. doi:10.1038/s43588-024-00741-1 , note =

  47. [47]

    Mind the Biases: Quantifying Cognitive Biases in Language Model Prompting

    Lin, Ruixi and Ng, Hwee Tou. Mind the Biases: Quantifying Cognitive Biases in Language Model Prompting. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.324

  48. [48]

    Cognitive Biases in Large Language Model based Decision Making: Insights and Mitigation Strategies , volume =

    Chen, Siduo , year =. Cognitive Biases in Large Language Model based Decision Making: Insights and Mitigation Strategies , volume =. Applied and Computational Engineering , doi =

  49. [49]

    2026 , eprint=

    Understanding the Anchoring Effect of LLM with Synthetic Data: Existence, Mechanism, and Potential Mitigations , author=. 2026 , eprint=

  50. [50]

    Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , year=

    Persistent Anti-Muslim Bias in Large Language Models , author=. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , year=

  51. [51]

    Nguyen , keywords =

    Jeremy K. Nguyen , keywords =. Human bias in AI models? Anchoring effects and mitigation strategies in large language models , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.jbef.2024.100971 , url =

  52. [52]

    2025 , eprint=

    Framing the Game: How Context Shapes LLM Decision-Making , author=. 2025 , eprint=

  53. [53]

    2025 , eprint=

    Confirmation Bias as a Cognitive Resource in LLM-Supported Deliberation , author=. 2025 , eprint=

  54. [54]

    Uncovering Stereotypes in Large Language Models: A Task Complexity-based Approach

    Shrawgi, Hari and Rath, Prasanjit and Singhal, Tushar and Dandapat, Sandipan. Uncovering Stereotypes in Large Language Models: A Task Complexity-based Approach. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.111

  55. [55]

    LLM s Reproduce Stereotypes of Sexual and Gender Minorities

    Ostrow, Ruby and Lopez, Adam. LLM s Reproduce Stereotypes of Sexual and Gender Minorities. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.946

  56. [56]

    Hudson, and Lining Yao

    Lu, Zhuoran and Yin, Ming , title =. 2021 , isbn =. doi:10.1145/3411764.3445562 , booktitle =

  57. [57]

    Judging the judges: A systematic study of position bias in LLM -as-a-judge

    Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush. Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguis...

  58. [58]

    European conference on computer vision , pages=

    Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=

  59. [59]

    Decoding LLM Personality Measurement: Forced-Choice vs

    Li, Xiaoyu and Shi, Haoran and Yu, Zengyi and Tu, Yukun and Zheng, Chanjin. Decoding LLM Personality Measurement: Forced-Choice vs. L ikert. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.480

  60. [60]

    A es B ias B ench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment

    Li, Kun and Po, Lai Man and Yang, Hongzheng and Xu, Xuyuan and Liu, Kangcheng and Zhao, Yuzhi. A es B ias B ench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.386

  61. [61]

    2026 , eprint=

    Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores , author=. 2026 , eprint=

  62. [62]

    Blinded by Context: Unveiling the Halo Effect of MLLM in AI Hiring

    Kim, Kyusik and Ryu, Jeongwoo and Jeon, Hyeonseok and Suh, Bongwon. Blinded by Context: Unveiling the Halo Effect of MLLM in AI Hiring. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1338

  63. [63]

    C o B ia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLM s

    Nikeghbal, Nafiseh and Kargaran, Amir Hossein and Diesner, Jana. C o B ia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.84

  64. [64]

    2026 , eprint=

    Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study , author=. 2026 , eprint=

  65. [65]

    2025 , eprint=

    The Bias is in the Details: An Assessment of Cognitive Bias in LLMs , author=. 2025 , eprint=

  66. [66]

    2026 , eprint=

    Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos , author=. 2026 , eprint=

  67. [67]

    2026 , eprint=

    VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models , author=. 2026 , eprint=

  68. [68]

    2025 , eprint=

    Bias Beyond Demographics: Probing Decision Boundaries in Black-Box LVLMs via Counterfactual VQA , author=. 2025 , eprint=

  69. [69]

    2025 , eprint=

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. 2025 , eprint=

  70. [70]

    2026 , eprint=

    Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models , author=. 2026 , eprint=