{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:CDU5EXKTMXISG6ZBLQGIKAZVCY","short_pith_number":"pith:CDU5EXKT","schema_version":"1.0","canonical_sha256":"10e9d25d5365d1237b215c0c8503351615a2c2c09fb537882eb2bb0c9bd2c997","source":{"kind":"arxiv","id":"2604.10528","version":4},"attestation_state":"computed","paper":{"title":"BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Current vision-language models lack genuine geometric comprehension and instead rely on texture and contextual shortcuts.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Aaditya Baranwal, Abhishek Rajora, Vishal Yadav","submitted_at":"2026-04-12T08:46:27Z","abstract_excerpt":"While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce $\\textbf{BareBones}$, a zero-shot benchmark designed to stre"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":false},"canonical_record":{"source":{"id":"2604.10528","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2026-04-12T08:46:27Z","cross_cats_sorted":[],"title_canon_sha256":"19abdaacccdd5cef35641ef8f7ad87c370cc685ea3b0b226b35c30a7d439689f","abstract_canon_sha256":"e9dfbca550e96d0b69b7394c76c6a98d6a71239f85b26a8aac0c1d270a84ba48"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-06-02T01:03:46.875936Z","signature_b64":"cpBwyl4B+ykfVhy0TW3ygnVcu6carAAPZBwZUz+JQijlfq482C1R8OnwWglwJko+qxXkPaw9Nn8hbWMRXcg5Cg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"10e9d25d5365d1237b215c0c8503351615a2c2c09fb537882eb2bb0c9bd2c997","last_reissued_at":"2026-06-02T01:03:46.875372Z","signature_status":"signed_v1","first_computed_at":"2026-06-02T01:03:46.875372Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Current vision-language models lack genuine geometric comprehension and instead rely on texture and contextual shortcuts.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Aaditya Baranwal, Abhishek Rajora, Vishal Yadav","submitted_at":"2026-04-12T08:46:27Z","abstract_excerpt":"While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce $\\textbf{BareBones}$, a zero-shot benchmark designed to stre"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (eg. GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the Texture Bias Cliff.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the curated pixel-level silhouettes and WTP-Bench taxonomy are truly noise-free and isolate geometric structure without inadvertently leaking semantic, contextual, or annotation cues that models could exploit.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VLMs exhibit a consistent 'Texture Bias Cliff' and fail to comprehend pure geometric shapes from boundary contours alone in zero-shot settings.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Current vision-language models lack genuine geometric comprehension and instead rely on texture and contextual shortcuts.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2a9b9d96f3827d8daa57f4103c9a3c6b17eedb45b595ff7feaa99a8b4979a629"},"source":{"id":"2604.10528","kind":"arxiv","version":4},"verdict":{"id":"ad7ac805-ee0f-4cc5-a214-0620925eb117","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-10T16:15:29.386337Z","strongest_claim":"Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (eg. GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the Texture Bias Cliff.","one_line_summary":"VLMs exhibit a consistent 'Texture Bias Cliff' and fail to comprehend pure geometric shapes from boundary contours alone in zero-shot settings.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the curated pixel-level silhouettes and WTP-Bench taxonomy are truly noise-free and isolate geometric structure without inadvertently leaking semantic, contextual, or annotation cues that models could exploit.","pith_extraction_headline":"Current vision-language models lack genuine geometric comprehension and instead rely on texture and contextual shortcuts."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.10528/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2604.10528","created_at":"2026-06-02T01:03:46.875440+00:00"},{"alias_kind":"arxiv_version","alias_value":"2604.10528v4","created_at":"2026-06-02T01:03:46.875440+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2604.10528","created_at":"2026-06-02T01:03:46.875440+00:00"},{"alias_kind":"pith_short_12","alias_value":"CDU5EXKTMXIS","created_at":"2026-06-02T01:03:46.875440+00:00"},{"alias_kind":"pith_short_16","alias_value":"CDU5EXKTMXISG6ZB","created_at":"2026-06-02T01:03:46.875440+00:00"},{"alias_kind":"pith_short_8","alias_value":"CDU5EXKT","created_at":"2026-06-02T01:03:46.875440+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY","json":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY.json","graph_json":"https://pith.science/api/pith-number/CDU5EXKTMXISG6ZBLQGIKAZVCY/graph.json","events_json":"https://pith.science/api/pith-number/CDU5EXKTMXISG6ZBLQGIKAZVCY/events.json","paper":"https://pith.science/paper/CDU5EXKT"},"agent_actions":{"view_html":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY","download_json":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY.json","view_paper":"https://pith.science/paper/CDU5EXKT","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2604.10528&json=true","fetch_graph":"https://pith.science/api/pith-number/CDU5EXKTMXISG6ZBLQGIKAZVCY/graph.json","fetch_events":"https://pith.science/api/pith-number/CDU5EXKTMXISG6ZBLQGIKAZVCY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY/action/storage_attestation","attest_author":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY/action/author_attestation","sign_citation":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY/action/citation_signature","submit_replication":"https://pith.science/pith/CDU5EXKTMXISG6ZBLQGIKAZVCY/action/replication_record"}},"created_at":"2026-06-02T01:03:46.875440+00:00","updated_at":"2026-06-02T01:03:46.875440+00:00"}