{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:EXNEXHS34HBJHXDIZFWV4V4ATB","short_pith_number":"pith:EXNEXHS3","schema_version":"1.0","canonical_sha256":"25da4b9e5be1c293dc68c96d5e5780985e68ac4cf4cd275df3a443c98744cefc","source":{"kind":"arxiv","id":"2410.17434","version":1},"attestation_state":"computed","paper":{"title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Balakrishnan Varadarajan, Bilge Soran, Changsheng Zhao, Chenchen Zhu, Fanyi Xiao, Florian Bordes, Hu Xu, Hyunwoo J. Kim, Jun Chen, Lemeng Wu, Mohamed Elhoseiny, Raghuraman Krishnamoorthi, Vikas Chandra, Xiaoqian Shen, Yunyang Xiong, Zechun Liu, Zhuang Liu","submitted_at":"2024-10-22T21:21:37Z","abstract_excerpt":"Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redun"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2410.17434","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2024-10-22T21:21:37Z","cross_cats_sorted":[],"title_canon_sha256":"1b6ac9fd9476f5260c2a24fde0b0a0761b95e10915c781dc05fadc0f7ab7e229","abstract_canon_sha256":"2c655dcf5b26292ac4b16b56aefe6dbd68a6c412c51af52fcb02b16e3e68c63d"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.688622Z","signature_b64":"nrXGYuHgVkX7VNtbVsUAqC99U+otrU1ckwGrytM/C6dpXDplQkNQpiTNumrt7w/ffMz/UhigNthgOFDokv9WAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"25da4b9e5be1c293dc68c96d5e5780985e68ac4cf4cd275df3a443c98744cefc","last_reissued_at":"2026-05-17T23:38:47.688103Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.688103Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Balakrishnan Varadarajan, Bilge Soran, Changsheng Zhao, Chenchen Zhu, Fanyi Xiao, Florian Bordes, Hu Xu, Hyunwoo J. Kim, Jun Chen, Lemeng Wu, Mohamed Elhoseiny, Raghuraman Krishnamoorthi, Vikas Chandra, Xiaoqian Shen, Yunyang Xiong, Zechun Liu, Zhuang Liu","submitted_at":"2024-10-22T21:21:37Z","abstract_excerpt":"Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redun"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that DINOv2 similarity reliably identifies redundant frames without discarding task-relevant visual information and that text-guided cross-modal queries plus temporal dependency reduction preserve all necessary details for downstream understanding.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c51cbd96859f5bbe81add0bfe675c69ffcfdc4971d503858cb493ede14a7dc69"},"source":{"id":"2410.17434","kind":"arxiv","version":1},"verdict":{"id":"60cef8cc-c0ca-4d12-bce1-aae87594e1e2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T13:49:27.580289Z","strongest_claim":"Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.","one_line_summary":"LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that DINOv2 similarity reliably identifies redundant frames without discarding task-relevant visual information and that text-guided cross-modal queries plus temporal dependency reduction preserve all necessary details for downstream understanding.","pith_extraction_headline":"LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context."},"references":{"count":35,"sample":[{"doi":"","year":null,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":1,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"","year":null,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":null,"title":"Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens","work_id":"cc937528-86d1-430f-bb5d-4980dbaadd72","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Token Merging: Your ViT But Faster","work_id":"528509bc-2611-4e7f-a772-ea14d25b6dae","ref_index":4,"cited_arxiv_id":"2210.09461","is_internal_anchor":true},{"doi":"","year":2005,"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","ref_index":5,"cited_arxiv_id":"2005.14165","is_internal_anchor":true}],"resolved_work":35,"snapshot_sha256":"894c6c4e8b922ab6362c19ac20437904ea9d062a18a2368a8d607314411962f1","internal_anchors":25},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c9d7cc26ff704e4bbf66c68b949bf9454bb467a5b19c2f37b2fc6203f2d0419a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.17434","created_at":"2026-05-17T23:38:47.688187+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.17434v1","created_at":"2026-05-17T23:38:47.688187+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.17434","created_at":"2026-05-17T23:38:47.688187+00:00"},{"alias_kind":"pith_short_12","alias_value":"EXNEXHS34HBJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"EXNEXHS34HBJHXDI","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"EXNEXHS3","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2412.04468","citing_title":"NVILA: Efficient Frontier Visual Language Models","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21988","citing_title":"Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22269","citing_title":"MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22678","citing_title":"Swift Sampling: Selecting Temporal Surprises via Taylor Series","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17260","citing_title":"LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17283","citing_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18018","citing_title":"See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19218","citing_title":"Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19506","citing_title":"EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23617","citing_title":"One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2501.00574","citing_title":"VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2511.13026","citing_title":"REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2501.12386","citing_title":"InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2512.08410","citing_title":"Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21334","citing_title":"Streaming Video Instruction Tuning","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14724","citing_title":"HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2601.10611","citing_title":"Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding","ref_index":129,"is_internal_anchor":true},{"citing_arxiv_id":"2602.20913","citing_title":"LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2603.01400","citing_title":"Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13831","citing_title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12056","citing_title":"OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09223","citing_title":"CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07575","citing_title":"Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05848","citing_title":"VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19564","citing_title":"EgoSelf: From Memory to Personalized Egocentric Assistant","ref_index":45,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB","json":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB.json","graph_json":"https://pith.science/api/pith-number/EXNEXHS34HBJHXDIZFWV4V4ATB/graph.json","events_json":"https://pith.science/api/pith-number/EXNEXHS34HBJHXDIZFWV4V4ATB/events.json","paper":"https://pith.science/paper/EXNEXHS3"},"agent_actions":{"view_html":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB","download_json":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB.json","view_paper":"https://pith.science/paper/EXNEXHS3","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.17434&json=true","fetch_graph":"https://pith.science/api/pith-number/EXNEXHS34HBJHXDIZFWV4V4ATB/graph.json","fetch_events":"https://pith.science/api/pith-number/EXNEXHS34HBJHXDIZFWV4V4ATB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB/action/storage_attestation","attest_author":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB/action/author_attestation","sign_citation":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB/action/citation_signature","submit_replication":"https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB/action/replication_record"}},"created_at":"2026-05-17T23:38:47.688187+00:00","updated_at":"2026-05-17T23:38:47.688187+00:00"}