Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

· 2025 · cs.LG · arXiv 2512.18951

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Infants learn not only object categories but also fine-grained visual attributes such as color, size, and texture from limited experience. Prior infant-scale vision--language models have mainly been evaluated on object recognition, leaving open whether they support within-class attribute discrimination. We introduce a controlled benchmark that varies color, size, and texture across 67 everyday object classes using synthetic rendering to decouple attribute values from object identity. We evaluate infant-trained models (CVCL and an infant-trained DINO baseline) against web-scale and ImageNet models (CLIP, SigLIP, ResNeXt) under two complementary settings: an image-only prototype test and a text--vision test with attribute--object prompts. We find a dissociation between visual and linguistic attribute information: infant-trained models form strong visual representations for size and discriminate texture comparably to other models, but perform poorly on visual color discrimination, and in the text--vision setting they struggle to ground color and show only modest size grounding. In contrast, web-trained vision--language models strongly ground color from text while exhibiting weaker visual size discrimination.

representative citing papers

Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

cs.LG · 2025-12-22 · unverdicted · novelty 6.0

Infant-scale VLMs discriminate size and texture visually but perform poorly on color and struggle to ground attributes in text, while web-scale models excel at color grounding.

citing papers explorer

Showing 1 of 1 citing paper.

Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models cs.LG · 2025-12-22 · unverdicted · none · ref 1 · internal anchor
Infant-scale VLMs discriminate size and texture visually but perform poorly on color and struggle to ground attributes in text, while web-scale models excel at color grounding.

Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

fields

years

verdicts

representative citing papers

citing papers explorer