{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:WHD2ZAOCEHRY3P4PBHP5FF2RDY","short_pith_number":"pith:WHD2ZAOC","schema_version":"1.0","canonical_sha256":"b1c7ac81c221e38dbf8f09dfd297511e28e68d6946a16ac84740f6bd226f0367","source":{"kind":"arxiv","id":"2505.21374","version":1},"attestation_state":"computed","paper":{"title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multimodal models perceive video details but fail to integrate scattered clues, scoring at most 45 percent on a new Holmes-inspired benchmark.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jing Liao, Junhao Cheng, Teng Wang, Ying Shan, Yixiao Ge, Yuying Ge","submitted_at":"2025-05-27T16:05:01Z","abstract_excerpt":"Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues befor"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2505.21374","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-05-27T16:05:01Z","cross_cats_sorted":[],"title_canon_sha256":"1037a1b2b279b5f0742dc6dfa56f6ffc64357cdb3e474d708d8ec7e95ff08200","abstract_canon_sha256":"7d62d4aba317088c9ae2a9712056750f44141128f5c8fcb45341f9e87195b8f1"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.952899Z","signature_b64":"DPxV2gSktf/tu7Om/tLwv/hBsvnVq/UXncyTX5Fm6H79zbcHZvLCbikHs/KxGZ16hJUoItoop6k3vd1cYtSrBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b1c7ac81c221e38dbf8f09dfd297511e28e68d6946a16ac84740f6bd226f0367","last_reissued_at":"2026-05-17T23:38:14.952213Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.952213Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multimodal models perceive video details but fail to integrate scattered clues, scoring at most 45 percent on a new Holmes-inspired benchmark.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jing Liao, Junhao Cheng, Teng Wang, Ying Shan, Yixiao Ge, Yuying Ge","submitted_at":"2025-05-27T16:05:01Z","abstract_excerpt":"Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues befor"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the seven manually designed tasks from suspense films accurately require and measure active search, integration, and analysis of multiple clues in a manner comparable to human expert reasoning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multimodal models perceive video details but fail to integrate scattered clues, scoring at most 45 percent on a new Holmes-inspired benchmark.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3704675f962dacd801246b4cb3c35d05ba10827bd3a3fc69eab6d8ec07ac857a"},"source":{"id":"2505.21374","kind":"arxiv","version":1},"verdict":{"id":"ebc0c269-76f7-412b-9420-4ff5479fcbe3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T05:36:20.043648Z","strongest_claim":"Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%.","one_line_summary":"Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the seven manually designed tasks from suspense films accurately require and measure active search, integration, and analysis of multiple clues in a manner comparable to human expert reasoning.","pith_extraction_headline":"Multimodal models perceive video details but fail to integrate scattered clues, scoring at most 45 percent on a new Holmes-inspired benchmark."},"references":{"count":51,"sample":[{"doi":"","year":2022,"title":"Chain-of-thought prompting elicits reasoning in large language models","work_id":"4160f614-809e-4eb0-8951-702539a20d52","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","ref_index":2,"cited_arxiv_id":"2402.03300","is_internal_anchor":true},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":3,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2024,"title":"Introducing openai o1","work_id":"993616f2-1ea2-492f-857c-c3236709e4af","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"OpenAI. Openai o3. 2025. 2, 9","work_id":"a6e82b4b-c165-409b-b4cc-1512baf99410","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":51,"snapshot_sha256":"7ea1c5d586d2fb268f50ac2a75fdf39861e27fd04aecb63a1a3fd1cef3ba6378","internal_anchors":21},"formal_canon":{"evidence_count":3,"snapshot_sha256":"b78b4c121060f46eb3708a4ffc3c6c4462c3c98da84b660eeddf9b30b2c974b3"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2505.21374","created_at":"2026-05-17T23:38:14.952339+00:00"},{"alias_kind":"arxiv_version","alias_value":"2505.21374v1","created_at":"2026-05-17T23:38:14.952339+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2505.21374","created_at":"2026-05-17T23:38:14.952339+00:00"},{"alias_kind":"pith_short_12","alias_value":"WHD2ZAOCEHRY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"WHD2ZAOCEHRY3P4P","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"WHD2ZAOC","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2606.05008","citing_title":"M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2606.03087","citing_title":"Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2606.07643","citing_title":"AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2606.07639","citing_title":"MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2606.02564","citing_title":"VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2606.02642","citing_title":"SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2606.02564","citing_title":"VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2605.26014","citing_title":"STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.25621","citing_title":"StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23216","citing_title":"CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21931","citing_title":"EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14692","citing_title":"Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2511.18373","citing_title":"MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2511.19972","citing_title":"Boosting Reasoning in Large Multimodal Models via Activation Replay","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2512.03043","citing_title":"OneThinker: All-in-one Reasoning Model for Image and Video","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2512.16918","citing_title":"AdaTooler-V: Adaptive Tool-Use for Images and Videos","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2603.20633","citing_title":"Seed1.8 Model Card: Towards Generalized Real-World Agency","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12034","citing_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12034","citing_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10966","citing_title":"MMTB: Evaluating Terminal Agents on Multimedia-File Tasks","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27083","citing_title":"Co-Evolving Policy Distillation","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27393","citing_title":"MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09874","citing_title":"EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding","ref_index":120,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03276","citing_title":"VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03276","citing_title":"VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY","json":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY.json","graph_json":"https://pith.science/api/pith-number/WHD2ZAOCEHRY3P4PBHP5FF2RDY/graph.json","events_json":"https://pith.science/api/pith-number/WHD2ZAOCEHRY3P4PBHP5FF2RDY/events.json","paper":"https://pith.science/paper/WHD2ZAOC"},"agent_actions":{"view_html":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY","download_json":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY.json","view_paper":"https://pith.science/paper/WHD2ZAOC","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2505.21374&json=true","fetch_graph":"https://pith.science/api/pith-number/WHD2ZAOCEHRY3P4PBHP5FF2RDY/graph.json","fetch_events":"https://pith.science/api/pith-number/WHD2ZAOCEHRY3P4PBHP5FF2RDY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY/action/storage_attestation","attest_author":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY/action/author_attestation","sign_citation":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY/action/citation_signature","submit_replication":"https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY/action/replication_record"}},"created_at":"2026-05-17T23:38:14.952339+00:00","updated_at":"2026-05-17T23:38:14.952339+00:00"}