Vision-Language Models Suppress Female Representations Under Ambiguous Input

Arnau Marin-Llobet; Mahzarin R. Banaji; Simon Henniger

arxiv: 2605.31556 · v1 · pith:EYIML2BCnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI· cs.CL· cs.CY· cs.HC

Vision-Language Models Suppress Female Representations Under Ambiguous Input

Arnau Marin-Llobet , Simon Henniger , Mahzarin R. Banaji This is my paper

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.CYcs.HC

keywords vision-language modelsgender biasambiguous inputslatent associationsoccupational stereotypesinternal representationsbias suppression

0 comments

The pith

Vision-language models internally encode female associations for ambiguous images but suppress them before outputting male labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that alignment in vision-language models reduces overt bias on clear images yet leaves a deeper decoupling intact under gender-ambiguous inputs common in real scenes. Using a new zero-shot metric, the authors track how visual-token activations associate with gender concepts layer by layer and find that female signals rise in middle layers then drop sharply before the final output while male signals strengthen throughout. This pattern holds across four models, fifteen occupations, and hundreds of back-view or gear-obscured images, showing that surface-level male defaults do not match the internal representations the models actually compute. A color ablation further shows that culturally loaded visual details can shift these internal associations even when the core figure remains ambiguous.

Core claim

Across fifteen occupations, over eight hundred gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter in which male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation.

What carries the argument

LALS (Latent Association Leaning Score), a metric that projects visual-token activations into the model's text-embedding space to quantify concept associations at each token and layer.

If this is right

Current alignment techniques that succeed on unambiguous inputs leave internal female representations vulnerable to suppression under uncertainty.
Interventions aimed at bias must target the mid-to-late layers where female signals are filtered rather than only the final output head.
Visual cues such as clothing color can modulate the strength of internal gender associations even when the figure itself remains ambiguous.
Models may default to male outputs precisely because the training process amplifies one signal while attenuating the other across depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed suppression could be tested by training a model variant that forces female signals to propagate unchanged to the output and checking whether male defaults disappear on the same ambiguous set.
If the pattern generalizes beyond gender, similar layer-wise filtering may affect other ambiguous attributes such as age or ethnicity in the same models.
Practical deployment in robotics or surveillance would need explicit uncertainty handling at the layer where suppression begins rather than post-hoc output correction.

Load-bearing premise

The projection step in LALS accurately captures genuine semantic associations inside the model rather than artifacts introduced by the embedding alignment itself.

What would settle it

Recomputing the layer-wise associations on the same images but with a different projection method or a held-out text embedding space yields the opposite pattern of female suppression.

Figures

Figures reproduced from arXiv: 2605.31556 by Arnau Marin-Llobet, Mahzarin R. Banaji, Simon Henniger.

**Figure 1.** Figure 1: Representative Summary of Findings. Top: when gender is visually clear, VLMs report it accurately. Bottom: when the image is gender-ambiguous (faceless figures, same occupations), models default to male under forced-choice prompting, even for femalestereotyped roles. straightforward: show the model an image, ask it a question, and check whether the output reflects stereotypical or harmful associations. I… view at source ↗

**Figure 2.** Figure 2: Representative ambiguous-gender images. Each shows a faceless figure in an occupation-specific setting [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: illustrates a kitchen scene under all four conditions. With no person present, the heatmap is nearly flat and the net LALS hovers near zero. Adding a man produces a clear male-leaning (blue) cluster localized on the person; adding a woman produces the opposite female-leaning (red) pattern in the corresponding region. When both are present, LALS correctly assigns male and female signal to the respective in… view at source ↗

**Figure 4.** Figure 4: Chain-of-thought reveals the male default (Qwen2-VL-7B-Instruct). Models are asked to list visual cues before committing to a guess (prompt in App. A.1). For both male-stereotyped (top) and femalestereotyped (bottom) occupations, the model outputs male. For the florist, it explicitly acknowledges the female stereotype yet still guesses male. 4.3 Layer Dynamics Reveal Asymmetric Filtering The decoupling … view at source ↗

**Figure 5.** Figure 5: Normalised LALS across network depth, grouped by regime (mean ± s.e.m.; shaded band: neutral zone |LALS| < 15%). Left: agreement (female) — female-leaning internally and in output. Middle: divergence — female-leaning internally but output as male; signal peaks mid-network and collapses at the final layer. Right: agreement (male) — signal preserved end-to-end. Per-model layer sweeps in [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 6.** Figure 6: Color ablation (Qwen2-VL, layer 8). Left: example images of construction workers (top) and nurses (bottom) differing only in clothing color. Middle: per-image normalised LALS per color condition (diamonds = means; dots = individual images). Right: dose-response for nurse scrubs across seven colors ordered by perceived femininity, showing change in LALS at the peak layer relative to gray (mean ± s.e.m.; lin… view at source ↗

**Figure 7.** Figure 7: validates the choice of top-5% aggregation used throughout the paper. We compute LALS on a held-out visible-gender set (Qwen2-VL) and measure two metrics as a function of the top-% of tokens aggregated by |LALS|: ROC-AUC for predicting visible gender, and sign accuracy (whether the imagelevel LALS matches the true gender). Both metrics peak between 5–7% and degrade as low-magnitude tokens dilute the signa… view at source ↗

**Figure 8.** Figure 8: Construction-site replication. The empty scene is neutral; inserting a man shifts the signal toward male (blue) and inserting a woman shifts it toward female (red), confirming that the kitchen-scene result generalizes across scene types. B.2 Real Photographs vs. Synthetic Images A natural concern is that our findings may be specific to AI-generated images [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Real vs. synthetic images (Qwen2-VL, N=10/condition; mean ± s.e.m.). Layer-wise LALS on real photographs follows the same trajectory as on synthetic images. Shaded band: neutral zone. C Extended Layer Analyses C.1 Per-Architecture Layer Sweep [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Per-architecture layer sweep. Normalised LALS across layers for 15 occupations and four VLM architectures (N=25 images per occupation). Each line is one occupation; blue = male-leaning, red = femaleleaning. C.2 Instruct vs. Base Model To test whether the late-layer suppression of female signal is introduced by instruction tuning, we run the same LALS layer sweep on the Qwen2-VL-7B base checkpoint (no RLH… view at source ↗

**Figure 11.** Figure 11: Instruct vs. base model comparison (Qwen2-VL-7B, N=25 images per occupation; mean ± s.e.m.). Normalised LALS across network depth for the instruction-tuned model (left) and the pre-RLHF base model (right). The asymmetric filtering pattern is present in both variants; instruction tuning amplifies but does not introduce it. D Causal Intervention: Is the Mid-Layer Signal Necessary? The layer sweeps in Sectio… view at source ↗

**Figure 12.** Figure 12: Causal intervention at layer 16 (Qwen2-VL-7B-Instruct, N=20 images per occupation). We project a single gender direction out of the visual-token activations at layer 16 during the forward pass (full ablation, α=1) and re-run the model. Left: the mid-layer LALS signal collapses for female-leaning occupations. Right: the forced-choice female rate drops in lockstep, while male-default occupations are unaffec… view at source ↗

read the original abstract

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports internal female encoding but male outputs on ambiguous images across VLMs using a new LALS projection metric, with the projection step as the main point to check.

read the letter

The key takeaway is that these models encode female associations internally for gender-ambiguous occupation images yet produce male outputs, with layer analysis showing male signals strengthening end-to-end while female signals peak mid-network and drop off. They introduce LALS as a zero-shot way to track this by projecting visual-token activations into text-embedding space.

The work runs controlled tests on four VLMs, 15 occupations, and over 800 images, plus a color ablation that shows clothing cues shifting the internal associations. That setup gives a clear picture of the decoupling and the asymmetric filtering, which is new relative to prior bias studies that focus on clear inputs or final outputs.

The soft spot sits in the LALS step itself. Treating cosine similarity after projection as a direct measure of concept association assumes the alignment between visual activations and text space preserves the relevant semantics rather than picking up residual geometry from training. The abstract gives no validation on unambiguous cases or comparison to logit probes, so the layer-wise suppression claim rests on that untested assumption. If the projection introduces artifacts, the internal-output mismatch becomes harder to interpret.

This is for people auditing fairness in deployed VLMs or studying how alignment affects internal representations. The empirical scope and ablations make it worth a serious referee even with the metric questions, mainly to get the authors to add checks on LALS and report the full statistics.

Recommendation: send it to review but flag the projection validation as a required addition.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that vision-language models exhibit a systematic decoupling between internal representations and generated outputs when processing gender-ambiguous images across 15 occupations: models often encode female associations internally (as measured by the introduced LALS metric) yet default to male outputs. Layer-wise analysis identifies an asymmetric filter in which male signals amplify end-to-end while female signals peak mid-network and are suppressed prior to generation; a color ablation further shows modulation by culturally loaded visual cues.

Significance. If the LALS metric is shown to capture genuine semantic associations rather than projection artifacts, the results would demonstrate a previously understudied internal-output mismatch in VLMs under realistic ambiguity, with direct relevance to bias auditing and mitigation. The zero-shot, layer-resolved probing approach constitutes a methodological contribution that could be applied more broadly, provided it receives independent validation.

major comments (2)

[LALS definition and validation] LALS projection step (methods describing the metric): the claim that cosine similarity after projecting visual-token activations into text-embedding space measures genuine concept associations per layer and token is load-bearing for the decoupling and asymmetric-filter results, yet the manuscript provides no validation experiments (e.g., recovery of expected gender associations on unambiguous images or correlation with logit-level probes) to rule out embedding-geometry artifacts.
[Experimental details and results] Experimental protocol (abstract and § on image construction): the central claim rests on results across >800 gender-ambiguous images, but the manuscript supplies no statistical details, error bars, exact construction protocol for the ambiguous images, or per-occupation trial counts, preventing assessment of whether the reported male-output bias and mid-network female peak are robust.

minor comments (1)

[Figures] Figure captions and layer-wise plots would benefit from explicit labeling of token positions and confidence intervals to improve readability of the suppression pattern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the manuscript. We address each major point below and will revise accordingly.

read point-by-point responses

Referee: [LALS definition and validation] LALS projection step (methods describing the metric): the claim that cosine similarity after projecting visual-token activations into text-embedding space measures genuine concept associations per layer and token is load-bearing for the decoupling and asymmetric-filter results, yet the manuscript provides no validation experiments (e.g., recovery of expected gender associations on unambiguous images or correlation with logit-level probes) to rule out embedding-geometry artifacts.

Authors: We agree that explicit validation of LALS would strengthen the interpretation of the metric. The current results rely on cross-model and cross-occupation consistency, but this does not fully substitute for targeted checks. We will add validation experiments in the revised manuscript, including tests on unambiguous images to recover expected gender associations and comparisons against logit-level probes. revision: yes
Referee: [Experimental details and results] Experimental protocol (abstract and § on image construction): the central claim rests on results across >800 gender-ambiguous images, but the manuscript supplies no statistical details, error bars, exact construction protocol for the ambiguous images, or per-occupation trial counts, preventing assessment of whether the reported male-output bias and mid-network female peak are robust.

Authors: We concur that additional experimental details are required for proper evaluation of robustness. The revision will include the precise image construction protocol, per-occupation trial counts, statistical measures (e.g., standard errors), and error bars on all relevant plots and tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metric application on external image set

full rationale

The paper introduces LALS as an explicitly defined projection-based metric and applies it to a held-out collection of 800+ ambiguous images across 15 occupations and four VLMs. No derivation step equates a reported association or suppression pattern to a fitted parameter, self-referential definition, or prior self-citation chain; the decoupling claim is obtained by direct measurement rather than by algebraic identity or renaming of inputs. The work therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full methods, equations, and experimental details unavailable. The primary invented element is the LALS metric itself. The central domain assumption is that projecting visual activations into text space yields faithful association measurements.

axioms (1)

domain assumption Projecting visual-token activations into the model's text-embedding space produces a valid measure of per-token and per-layer concept associations
This premise underpins the entire LALS metric and the claim of internal female encoding; it is invoked when the abstract introduces the zero-shot projection technique.

invented entities (1)

LALS (Latent Association Leaning Score) no independent evidence
purpose: Zero-shot metric that projects visual-token activations into text-embedding space to quantify concept associations at each token and layer
Newly defined in the paper to enable the internal analysis; no independent evidence outside this work is described.

pith-pipeline@v0.9.1-grok · 5736 in / 1518 out tokens · 34329 ms · 2026-06-28T22:41:08.497296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Semantics derived automatically from lan- guage corpora contain human-like biases.Science, 356(6334):183–186. Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, and 1 others. 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1968–1994

Harms of gender exclusivity and challenges in non-binary representation in language technologies. InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1968–1994. Kathleen C Fraser and Svetlana Kiritchenko. 2024. Examining gender and racial bias in large vision– language models using a novel dataset of parallel ...

2021
[3]

Hila Gonen and Yoav Goldberg

Bias and fairness in large language models: A survey.Computational linguistics, 50(3):1097–1179. Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Ling...

work page arXiv 2019
[4]

arXiv preprint arXiv:2004.12265 , year=

Non-archival. Chandler May, Alex Wang, Shikha Bordia, Samuel Bow- man, and Rachel Rudinger. 2019. On measuring so- cial biases in sentence encoders. InProceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 622–628. Kevin Meng,...

work page arXiv 2019

[1] [1]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Semantics derived automatically from lan- guage corpora contain human-like biases.Science, 356(6334):183–186. Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, and 1 others. 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1968–1994

Harms of gender exclusivity and challenges in non-binary representation in language technologies. InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1968–1994. Kathleen C Fraser and Svetlana Kiritchenko. 2024. Examining gender and racial bias in large vision– language models using a novel dataset of parallel ...

2021

[3] [3]

Hila Gonen and Yoav Goldberg

Bias and fairness in large language models: A survey.Computational linguistics, 50(3):1097–1179. Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Ling...

work page arXiv 2019

[4] [4]

arXiv preprint arXiv:2004.12265 , year=

Non-archival. Chandler May, Alex Wang, Shikha Bordia, Samuel Bow- man, and Rachel Rudinger. 2019. On measuring so- cial biases in sentence encoders. InProceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 622–628. Kevin Meng,...

work page arXiv 2019