pith:RDI4RZGZ
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Long and rich context modeling lets video MLLMs process at least six times longer inputs while gaining object tracking and segmentation skills.
arxiv:2501.12386 v3 · 2025-01-21 · cs.CV
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{RDI4RZGZ2IS7TIMSM7WPIBC4BX}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation.
The reported gains in context length, benchmark scores, and specialized vision tasks are attributable to the long and rich context modeling components rather than differences in training data volume, model scale, or benchmark selection.
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:15.344963Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
88d1c8e4d9d225f9a19267ecf4045c0ddf7862abce6668f060d7fca71f012c87
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/RDI4RZGZ2IS7TIMSM7WPIBC4BX \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 88d1c8e4d9d225f9a19267ecf4045c0ddf7862abce6668f060d7fca71f012c87
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "8df7b623dddd4a4075d05f7a8df1784604860d9c0984f680b8368e1ffc14d47a",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
"primary_cat": "cs.CV",
"submitted_at": "2025-01-21T18:59:00Z",
"title_canon_sha256": "dde1b16d0f441e24c50b224a891f44b330b55b20a8347dab0577c123db7179f2"
},
"schema_version": "1.0",
"source": {
"id": "2501.12386",
"kind": "arxiv",
"version": 3
}
}