{"paper":{"title":"BLINK: Multimodal Large Language Models Can See but Not Perceive","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Multimodal LLMs like GPT-4V reach only 51% accuracy on visual perception tasks that humans solve at 96%.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Bangzheng Li, Dan Roth, Haoyu Wang, Noah A. Smith, Ranjay Krishna, Wei-Chiu Ma, Xingyu Fu, Xudong Lin, Yu Feng, Yushi Hu","submitted_at":"2024-04-18T17:59:54Z","abstract_excerpt":"We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans \"within a blink\" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or mult"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not emerged yet in recent multimodal LLMs","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the selected tasks genuinely require visual perception that cannot be solved through language patterns or statistical shortcuts in the training data.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multimodal LLMs like GPT-4V reach only 51% accuracy on visual perception tasks that humans solve at 96%.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2917149b3d2dde9ce6f177db5480221f03644168e304eef6bdde5e78bf6798a4"},"source":{"id":"2404.12390","kind":"arxiv","version":4},"verdict":{"id":"8e751e7c-b924-40d4-95b3-a023e85b974f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:14:26.616449Z","strongest_claim":"even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not emerged yet in recent multimodal LLMs","one_line_summary":"BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the selected tasks genuinely require visual perception that cannot be solved through language patterns or statistical shortcuts in the training data.","pith_extraction_headline":"Multimodal LLMs like GPT-4V reach only 51% accuracy on visual perception tasks that humans solve at 96%."},"references":{"count":90,"sample":[{"doi":"","year":2024,"title":"Introducing the next generation of claude.https://www.anthropic.com/news/ claude-3-family (March 2024) 11, 12, 23, 24","work_id":"f8bea833-ebe2-4366-aa34-0ae3433f6dc7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"In: AAAI (2019) 10","work_id":"365c08d1-df9d-4bc5-859b-42eb662a253d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22","work_id":"88ba30a1-83ea-4142-9669-25cd7d6dffa5","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"In: Proceedings of the IEEE international conference on computer vision","work_id":"a8727b08-59c4-43f9-b90a-a33fa63ffb0b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","work_id":"87bfa84a-e663-4165-806f-93ef439d88d0","ref_index":5,"cited_arxiv_id":"2308.01390","is_internal_anchor":true}],"resolved_work":90,"snapshot_sha256":"0ff96068e0e13eefed46d8bbcc5691567336e60abbd40ed9842bf8fa961171ee","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b1aea25256854ddbeeb2d9df367d438391cb3695d8baf28238790b470a4e8fb0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}