Pith Number

pith:RDI4RZGZ

pith:2025:RDI4RZGZ2IS7TIMSM7WPIBC4BX

not attested not anchored not stored refs resolved

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Changlian Ma, Chenting Wang, Haian Huang, Jianfei Gao, Jiashuo Yu, Kai Chen, Limin Wang, Min Dou, Wenhai Wang, Xiangyu Zeng, Xinhao Li, Yali Wang, Yinan He, Yi Wang, Yu Qiao, Ziang Yan

Long and rich context modeling lets video MLLMs process at least six times longer inputs while gaining object tracking and segmentation skills.

arxiv:2501.12386 v3 · 2025-01-21 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{RDI4RZGZ2IS7TIMSM7WPIBC4BX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation.

C2weakest assumption

The reported gains in context length, benchmark scores, and specialized vision tasks are attributable to the long and rich context modeling components rather than differences in training data volume, model scale, or benchmark selection.

C3one line summary

InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject

References

37 extracted · 37 resolved · 16 Pith anchors

[1] Cosmos World Foundation Model Platform for Physical AI · arXiv:2501.03575

[2] Qwen Technical Report · arXiv:2309.16609

[3] One token to seg them all: Language instructed reasoning segmentation in videos

[4] Token Merging: Your ViT But Faster · arXiv:2210.09461

[5] InternLM2 Technical Report · arXiv:2403.17297

Formal links

2 machine-checked theorem links

Cited by

28 papers in Pith

VISD: Enhancing Video Reasoning via Structured Self-Distillation

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

OProver: A Unified Framework for Agentic Formal Theorem Proving

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

Receipt and verification

First computed	2026-05-17T23:38:15.344963Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

88d1c8e4d9d225f9a19267ecf4045c0ddf7862abce6668f060d7fca71f012c87

Aliases

arxiv: 2501.12386 · arxiv_version: 2501.12386v3 · doi: 10.48550/arxiv.2501.12386 · pith_short_12: RDI4RZGZ2IS7 · pith_short_16: RDI4RZGZ2IS7TIMS · pith_short_8: RDI4RZGZ

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/RDI4RZGZ2IS7TIMSM7WPIBC4BX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 88d1c8e4d9d225f9a19267ecf4045c0ddf7862abce6668f060d7fca71f012c87

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "8df7b623dddd4a4075d05f7a8df1784604860d9c0984f680b8368e1ffc14d47a",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-01-21T18:59:00Z",
    "title_canon_sha256": "dde1b16d0f441e24c50b224a891f44b330b55b20a8347dab0577c123db7179f2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.12386",
    "kind": "arxiv",
    "version": 3
  }
}