Vision Language Models are Biased
Pith reviewed 2026-05-19 12:27 UTC · model grok-4.3
The pith
Vision language models fail at basic counting because memorized knowledge overrides what they see in images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-of-the-art VLMs are strongly biased by prior knowledge about popular subjects, scoring an average of 17.05% accuracy in counting across seven domains from animals, logos, chess, board games, optical illusions to patterned grids. They fail to recognize changes such as an added fourth stripe on a three-stripe Adidas logo. Removing image backgrounds nearly doubles accuracy by reducing contextual cues that activate memorized responses, while counting accuracy rises with moderate thinking tokens before declining with excessive reasoning.
What carries the argument
Interference from memorized prior knowledge about common objects and logos that overrides the actual visual input during counting and identification tasks.
Load-bearing premise
The tested counting and identification tasks are purely objective visual problems whose correct answers do not depend on contextual or memorized knowledge.
What would settle it
A model trained without exposure to internet data on popular logos and objects achieving high accuracy on the same counting tasks would show the bias is not caused by prior knowledge.
read the original abstract
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs' reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that state-of-the-art vision-language models exhibit strong prior-knowledge bias on objective visual tasks of counting and identification, achieving only 17.05% average accuracy across seven domains (animals, logos, chess, board games, optical illusions, patterned grids). It reports that removing image backgrounds raises accuracy by 21.09 percentage points and that accuracy rises with thinking tokens to ~40% before declining with excessive reasoning. The work also presents a human-supervised automated framework for testing such biases, with code and data released.
Significance. If the reported effects are shown to be robust, the result would identify a practically important limitation: memorized knowledge can override basic visual perception in VLMs, with direct consequences for applications that rely on accurate counting or identification. The release of code and data is a clear strength that enables independent verification and extension.
major comments (3)
- [Abstract] Abstract: the manuscript states concrete accuracy figures (17.05% counting accuracy, +21.09 pp background-removal gain) and a causal interpretation (contextual cues trigger bias) yet supplies no description of image synthesis, modification visibility, prompt templates, model list, dataset sizes, or statistical tests. Without these details the quantitative claims cannot be evaluated for measurement validity or post-hoc selection.
- [Abstract] Abstract: the central claim that poor performance is caused by prior-knowledge bias presupposes that the tested counting problems (e.g., detecting an added stripe on an Adidas-style logo) possess unambiguous visual ground truth independent of memorized knowledge. No controls comparing popular versus neutral patterns, no verification of image quality, and no comparison to general VLM counting limits are described, leaving alternative explanations unaddressed.
- [Abstract] Abstract: the background-removal result is offered as evidence that contextual cues trigger bias, but the abstract provides neither per-domain numbers nor any analysis confirming that the improvement isolates memorized-knowledge effects rather than simply reducing visual clutter.
minor comments (1)
- [Abstract] Abstract: the seven domains are enumerated but no concrete examples or construction details are given, making it difficult to judge how representative or controlled the test cases are.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each major point below and have revised the manuscript to strengthen the presentation of our claims and methods.
read point-by-point responses
-
Referee: [Abstract] Abstract: the manuscript states concrete accuracy figures (17.05% counting accuracy, +21.09 pp background-removal gain) and a causal interpretation (contextual cues trigger bias) yet supplies no description of image synthesis, modification visibility, prompt templates, model list, dataset sizes, or statistical tests. Without these details the quantitative claims cannot be evaluated for measurement validity or post-hoc selection.
Authors: We agree the abstract is concise and omits these specifics to meet length requirements. The full manuscript details the image synthesis process, visible modifications, prompt templates, evaluated models, dataset sizes, and statistical tests in the Methods and Experiments sections. We have revised the abstract to add a brief summary of the experimental protocol and refer readers to the paper for complete information. revision: yes
-
Referee: [Abstract] Abstract: the central claim that poor performance is caused by prior-knowledge bias presupposes that the tested counting problems (e.g., detecting an added stripe on an Adidas-style logo) possess unambiguous visual ground truth independent of memorized knowledge. No controls comparing popular versus neutral patterns, no verification of image quality, and no comparison to general VLM counting limits are described, leaving alternative explanations unaddressed.
Authors: The tasks use explicit, objective modifications (e.g., adding a visible fourth stripe to a three-stripe logo) that establish clear visual ground truth independent of prior knowledge, as described in the image generation process. We acknowledge that the abstract does not explicitly discuss controls or baselines. In the revision we have added text clarifying image quality verification through human review and included a comparison to general VLM counting performance on non-biased images to address alternative explanations. revision: yes
-
Referee: [Abstract] Abstract: the background-removal result is offered as evidence that contextual cues trigger bias, but the abstract provides neither per-domain numbers nor any analysis confirming that the improvement isolates memorized-knowledge effects rather than simply reducing visual clutter.
Authors: The background-removal result is presented as support for contextual cues triggering bias, with the average gain reported in the abstract. The full paper contains per-domain breakdowns and analysis linking improvements to domains with strong prior associations rather than generic clutter reduction. We have partially revised the abstract to note the consistency across domains and briefly clarify the isolating nature of the experiment. revision: partial
Circularity Check
Empirical measurement study with no derivation chain
full rationale
The paper reports direct empirical accuracy measurements (17.05% average counting accuracy, +21.09 pp after background removal) on VLM tasks across domains. No equations, derivations, fitted parameters, or predictive models are present in the abstract or described framework. Results are observational test outcomes rather than quantities reduced by construction from self-citations, ansatzes, or prior author results. The work is self-contained as a measurement study against external benchmarks (VLM outputs on provided images), with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs inherit and are swayed by memorized prior knowledge from the Internet on downstream visual tasks
Forward citations
Cited by 3 Pith papers
-
Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images
An adapted Cellpose-SAM pipeline achieves 1.50% MAPE on ASTM grain size number G using only two training images while maintaining topological separation better than U-Net, MatSAM, or Qwen2.5-VL-7B.
-
Watch Before You Answer: Learning from Visually Grounded Post-Training
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.