Vision Language Models are Biased

Anh Totti Nguyen; An Vo; Daeyoung Kim; Khai-Nguyen Nguyen; Mohammad Reza Taesiri; Vy Tuong Dang

arxiv: 2505.23941 · v4 · submitted 2025-05-29 · 💻 cs.LG · cs.CV

Vision Language Models are Biased

An Vo , Khai-Nguyen Nguyen , Mohammad Reza Taesiri , Vy Tuong Dang , Anh Totti Nguyen , Daeyoung Kim This is my paper

Pith reviewed 2026-05-19 12:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords vision language modelsmodel biascounting tasksprior knowledgevisual perceptionmultimodal modelsobject identificationreasoning patterns

0 comments

The pith

Vision language models fail at basic counting because memorized knowledge overrides what they see in images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that state-of-the-art vision language models carry strong prior knowledge about popular subjects from their training data. This knowledge interferes with objective visual tasks such as counting stripes on logos or elements in grids and games. Models reach only 17 percent accuracy on average across seven domains, with examples like missing an added stripe on an Adidas-style logo. Removing image backgrounds raises accuracy by more than 20 points, confirming that familiar context triggers the wrong answers. The work also finds that moderate reasoning steps improve results up to about 40 percent before further steps reduce accuracy again.

Core claim

State-of-the-art VLMs are strongly biased by prior knowledge about popular subjects, scoring an average of 17.05% accuracy in counting across seven domains from animals, logos, chess, board games, optical illusions to patterned grids. They fail to recognize changes such as an added fourth stripe on a three-stripe Adidas logo. Removing image backgrounds nearly doubles accuracy by reducing contextual cues that activate memorized responses, while counting accuracy rises with moderate thinking tokens before declining with excessive reasoning.

What carries the argument

Interference from memorized prior knowledge about common objects and logos that overrides the actual visual input during counting and identification tasks.

Load-bearing premise

The tested counting and identification tasks are purely objective visual problems whose correct answers do not depend on contextual or memorized knowledge.

What would settle it

A model trained without exposure to internet data on popular logos and objects achieving high accuracy on the same counting tasks would show the bias is not caused by prior knowledge.

read the original abstract

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs' reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that state-of-the-art vision-language models exhibit strong prior-knowledge bias on objective visual tasks of counting and identification, achieving only 17.05% average accuracy across seven domains (animals, logos, chess, board games, optical illusions, patterned grids). It reports that removing image backgrounds raises accuracy by 21.09 percentage points and that accuracy rises with thinking tokens to ~40% before declining with excessive reasoning. The work also presents a human-supervised automated framework for testing such biases, with code and data released.

Significance. If the reported effects are shown to be robust, the result would identify a practically important limitation: memorized knowledge can override basic visual perception in VLMs, with direct consequences for applications that rely on accurate counting or identification. The release of code and data is a clear strength that enables independent verification and extension.

major comments (3)

[Abstract] Abstract: the manuscript states concrete accuracy figures (17.05% counting accuracy, +21.09 pp background-removal gain) and a causal interpretation (contextual cues trigger bias) yet supplies no description of image synthesis, modification visibility, prompt templates, model list, dataset sizes, or statistical tests. Without these details the quantitative claims cannot be evaluated for measurement validity or post-hoc selection.
[Abstract] Abstract: the central claim that poor performance is caused by prior-knowledge bias presupposes that the tested counting problems (e.g., detecting an added stripe on an Adidas-style logo) possess unambiguous visual ground truth independent of memorized knowledge. No controls comparing popular versus neutral patterns, no verification of image quality, and no comparison to general VLM counting limits are described, leaving alternative explanations unaddressed.
[Abstract] Abstract: the background-removal result is offered as evidence that contextual cues trigger bias, but the abstract provides neither per-domain numbers nor any analysis confirming that the improvement isolates memorized-knowledge effects rather than simply reducing visual clutter.

minor comments (1)

[Abstract] Abstract: the seven domains are enumerated but no concrete examples or construction details are given, making it difficult to judge how representative or controlled the test cases are.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major point below and have revised the manuscript to strengthen the presentation of our claims and methods.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states concrete accuracy figures (17.05% counting accuracy, +21.09 pp background-removal gain) and a causal interpretation (contextual cues trigger bias) yet supplies no description of image synthesis, modification visibility, prompt templates, model list, dataset sizes, or statistical tests. Without these details the quantitative claims cannot be evaluated for measurement validity or post-hoc selection.

Authors: We agree the abstract is concise and omits these specifics to meet length requirements. The full manuscript details the image synthesis process, visible modifications, prompt templates, evaluated models, dataset sizes, and statistical tests in the Methods and Experiments sections. We have revised the abstract to add a brief summary of the experimental protocol and refer readers to the paper for complete information. revision: yes
Referee: [Abstract] Abstract: the central claim that poor performance is caused by prior-knowledge bias presupposes that the tested counting problems (e.g., detecting an added stripe on an Adidas-style logo) possess unambiguous visual ground truth independent of memorized knowledge. No controls comparing popular versus neutral patterns, no verification of image quality, and no comparison to general VLM counting limits are described, leaving alternative explanations unaddressed.

Authors: The tasks use explicit, objective modifications (e.g., adding a visible fourth stripe to a three-stripe logo) that establish clear visual ground truth independent of prior knowledge, as described in the image generation process. We acknowledge that the abstract does not explicitly discuss controls or baselines. In the revision we have added text clarifying image quality verification through human review and included a comparison to general VLM counting performance on non-biased images to address alternative explanations. revision: yes
Referee: [Abstract] Abstract: the background-removal result is offered as evidence that contextual cues trigger bias, but the abstract provides neither per-domain numbers nor any analysis confirming that the improvement isolates memorized-knowledge effects rather than simply reducing visual clutter.

Authors: The background-removal result is presented as support for contextual cues triggering bias, with the average gain reported in the abstract. The full paper contains per-domain breakdowns and analysis linking improvements to domains with strong prior associations rather than generic clutter reduction. We have partially revised the abstract to note the consistency across domains and briefly clarify the isolating nature of the experiment. revision: partial

Circularity Check

0 steps flagged

Empirical measurement study with no derivation chain

full rationale

The paper reports direct empirical accuracy measurements (17.05% average counting accuracy, +21.09 pp after background removal) on VLM tasks across domains. No equations, derivations, fitted parameters, or predictive models are present in the abstract or described framework. Results are observational test outcomes rather than quantities reduced by construction from self-citations, ansatzes, or prior author results. The work is self-contained as a measurement study against external benchmarks (VLM outputs on provided images), with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical bias audit of existing models; the abstract introduces no fitted parameters, new mathematical axioms, or postulated entities.

axioms (1)

domain assumption VLMs inherit and are swayed by memorized prior knowledge from the Internet on downstream visual tasks
Abstract states this as the mechanism that hurts accuracy on objective counting and identification.

pith-pipeline@v0.9.0 · 5728 in / 1245 out tokens · 55773 ms · 2026-05-19T12:27:56.249404+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images
cs.CV 2026-04 unverdicted novelty 6.0

An adapted Cellpose-SAM pipeline achieves 1.50% MAPE on ASTM grain size number G using only two training images while maintaining topological separation better than U-Net, MatSAM, or Qwen2.5-VL-7B.
Watch Before You Answer: Learning from Visually Grounded Post-Training
cs.CV 2026-04 unverdicted novelty 6.0

Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.