pith. sign in
Pith Number

pith:WHD2ZAOC

pith:2025:WHD2ZAOCEHRY3P4PBHP5FF2RDY
not attested not anchored not stored refs resolved

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Jing Liao, Junhao Cheng, Teng Wang, Ying Shan, Yixiao Ge, Yuying Ge

Multimodal models perceive video details but fail to integrate scattered clues, scoring at most 45 percent on a new Holmes-inspired benchmark.

arxiv:2505.21374 v1 · 2025-05-27 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{WHD2ZAOCEHRY3P4PBHP5FF2RDY}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%.

C2weakest assumption

The assumption that the seven manually designed tasks from suspense films accurately require and measure active search, integration, and analysis of multiple clues in a manner comparable to human expert reasoning.

C3one line summary

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

References

51 extracted · 51 resolved · 21 Pith anchors

[1] Chain-of-thought prompting elicits reasoning in large language models 2022
[2] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 2024 · arXiv:2402.03300
[3] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948
[4] Introducing openai o1 2024
[5] OpenAI. Openai o3. 2025. 2, 9 2025

Formal links

3 machine-checked theorem links

Cited by

32 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:14.952213Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

b1c7ac81c221e38dbf8f09dfd297511e28e68d6946a16ac84740f6bd226f0367

Aliases

arxiv: 2505.21374 · arxiv_version: 2505.21374v1 · doi: 10.48550/arxiv.2505.21374 · pith_short_12: WHD2ZAOCEHRY · pith_short_16: WHD2ZAOCEHRY3P4P · pith_short_8: WHD2ZAOC
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/WHD2ZAOCEHRY3P4PBHP5FF2RDY \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b1c7ac81c221e38dbf8f09dfd297511e28e68d6946a16ac84740f6bd226f0367
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7d62d4aba317088c9ae2a9712056750f44141128f5c8fcb45341f9e87195b8f1",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-05-27T16:05:01Z",
    "title_canon_sha256": "1037a1b2b279b5f0742dc6dfa56f6ffc64357cdb3e474d708d8ec7e95ff08200"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2505.21374",
    "kind": "arxiv",
    "version": 1
  }
}