Recognition: no theorem link
Long Context Transfer from Language to Vision
Pith reviewed 2026-05-12 07:00 UTC · model grok-4.3
The pith
Extending language model context length transfers directly to vision, letting multimodal models handle orders of magnitude more visual tokens without video training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer. The resulting Long Video Assistant processes 2000 frames or over 200K visual tokens and reaches state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames.
What carries the argument
Long context transfer: the direct application of language-model context extrapolation to sequences of visual tokens inside the shared transformer.
Load-bearing premise
Long-context capabilities learned purely from text sequences transfer effectively and without significant degradation to sequences of visual tokens in the same transformer architecture.
What would settle it
If a model whose context length has been extrapolated shows no improvement over a standard-context model when both are tested on long visual sequences in V-NIAH, the transfer claim is falsified.
read the original abstract
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that long-context capabilities acquired by the language model backbone of LMMs transfer to the vision modality, allowing models to process orders of magnitude more visual tokens (e.g., 200K tokens from 2000 video frames) without any video training. This 'long context transfer' is demonstrated via context-length extrapolation on the LM (e.g., RoPE scaling), supported by ablations, the new synthetic V-NIAH benchmark for long visual retrieval, and SOTA results on Video-MME among 7B-scale models using their LongVA model that densely samples frames.
Significance. If the transfer holds after proper controls, the result is significant: it offers a simple, training-free way to scale LMMs to long videos by reusing text long-context techniques, avoiding expensive video data collection and multimodal long-context pretraining. The V-NIAH benchmark provides a controlled, synthetic testbed for vision long-context generalization, and open-sourcing the model and code strengthens the contribution for the community.
major comments (3)
- [§3, §4.1] §3 (Method) and §4.1 (V-NIAH): The central transfer claim requires that text-derived positional extrapolation generalizes to visual token sequences, yet no analysis isolates whether observed long-context behavior on V-NIAH stems from the LM backbone's text pretraining versus incidental effects of the multimodal pretraining on the shared transformer; a control experiment using a short-context LM backbone fine-tuned only on short multimodal data is needed to establish causality.
- [§5.2, Table 2] §5.2 (Ablations) and Table 2: The reported gains on Video-MME from increasing frame count rely on dense sampling enabled by extended context, but without reporting the performance curve versus context length or error analysis (e.g., attention collapse or information loss) for sequences beyond the original training length, it is unclear whether degradation occurs or if the transfer is lossless as claimed.
- [§4.1] §4.1 (V-NIAH construction): The synthetic benchmark inserts 'needles' into visual 'haystacks,' but the paper does not specify how the visual token embeddings for needles and haystacks are sampled from the vision encoder's output distribution; without matching real video statistics or providing a text-to-vision token distribution comparison, the benchmark may overestimate transferability.
minor comments (3)
- [Figure 1] Figure 1: The diagram of token flow could clarify the exact point at which context extrapolation is applied (before or after vision encoder projection) and include a quantitative comparison of token counts.
- [Abstract, §1] Abstract and §1: The phrase 'orders of magnitude more visual tokens' should be quantified with the precise scaling factor achieved relative to the base model's context length.
- [Related Work] Related work section: A few recent long-video LMM papers using resamplers or memory mechanisms are not cited; adding them would better position the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve the paper.
read point-by-point responses
-
Referee: [§3, §4.1] §3 (Method) and §4.1 (V-NIAH): The central transfer claim requires that text-derived positional extrapolation generalizes to visual token sequences, yet no analysis isolates whether observed long-context behavior on V-NIAH stems from the LM backbone's text pretraining versus incidental effects of the multimodal pretraining on the shared transformer; a control experiment using a short-context LM backbone fine-tuned only on short multimodal data is needed to establish causality.
Authors: We agree that a control experiment with a short-context LM backbone would help isolate the contribution of text pretraining to the observed transfer. Our current setup uses an LM backbone pretrained on long text contexts, with subsequent multimodal training on shorter sequences. The scaling behaviors in our ablations support that the long-context capabilities originate from the LM component. However, conducting the suggested control would require substantial additional compute for retraining. We will add a clarification in the method section and a discussion in the limitations regarding this aspect. revision: partial
-
Referee: [§5.2, Table 2] §5.2 (Ablations) and Table 2: The reported gains on Video-MME from increasing frame count rely on dense sampling enabled by extended context, but without reporting the performance curve versus context length or error analysis (e.g., attention collapse or information loss) for sequences beyond the original training length, it is unclear whether degradation occurs or if the transfer is lossless as claimed.
Authors: We have conducted experiments varying the context length and will include a performance curve versus context length in the revised version of Table 2 or as a new figure in Section 5.2. For error analysis, our results indicate stable performance without notable degradation or attention collapse up to the tested lengths; we will incorporate a brief analysis of attention maps and potential information loss in the ablations section. revision: yes
-
Referee: [§4.1] §4.1 (V-NIAH construction): The synthetic benchmark inserts 'needles' into visual 'haystacks,' but the paper does not specify how the visual token embeddings for needles and haystacks are sampled from the vision encoder's output distribution; without matching real video statistics or providing a text-to-vision token distribution comparison, the benchmark may overestimate transferability.
Authors: We will revise Section 4.1 to specify the construction details: the haystack frames are sampled from a diverse set of videos, and needle frames are specific target images processed through the same vision encoder. The token embeddings are the direct outputs of the vision encoder without additional sampling modifications. We acknowledge the lack of explicit distribution matching and will add a comparison note or discussion on how the synthetic setup relates to real video token statistics to address potential overestimation concerns. revision: yes
Circularity Check
No circularity: empirical extrapolation validated on benchmarks
full rationale
The paper's core claim—that extrapolating the language backbone's context length enables LMMs to handle far more visual tokens without video-specific training—is presented as an observed empirical phenomenon, not a mathematical derivation. It is supported by direct experiments on the introduced V-NIAH synthetic benchmark and Video-MME, with ablations showing behavior under extended contexts. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The 'long context transfer' label is merely descriptive naming of the observed transfer effect, and the work remains self-contained against external benchmarks rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer attention mechanisms can extrapolate to context lengths beyond those seen during training when positional encodings allow it.
Forward citations
Cited by 37 Pith papers
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
-
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassi...
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging
MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
-
HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration
HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing
SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% toke...
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
-
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
-
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
-
EgoSelf: From Memory to Personalized Egocentric Assistant
EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
- [1]
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...
work page 2022
-
[3]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015
work page 2015
-
[4]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint ar...
work page internal anchor Pith review arXiv 2023
-
[5]
Longalign: A recipe for long context alignment of large language models, 2024
Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. Longalign: A recipe for long context alignment of large language models, 2024
work page 2024
-
[6]
ntkaware scaled rope allows llama models to have, 2023
bloc97. ntkaware scaled rope allows llama models to have, 2023
work page 2023
-
[7]
bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2024
work page 2024
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
work page 1901
-
[9]
Matryoshka multimodal models, 2024
Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models, 2024
work page 2024
- [10]
-
[11]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models, 2024
work page 2024
-
[12]
Sharegpt4video: Improving video understanding and generation with better captions, 2024
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions, 2024
work page 2024
-
[13]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. 10
work page internal anchor Pith review arXiv 2023
-
[14]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms, 2024
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms, 2024
work page 2024
-
[16]
Generating long sequences with sparse transformers, 2019
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019
work page 2019
-
[17]
Introducing command r+: A scalable llm built for business, 2024
Cohere. Introducing command r+: A scalable llm built for business, 2024
work page 2024
-
[18]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[19]
Flashattention-2: Faster attention with better parallelism and work partitioning, 2023
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023
work page 2023
-
[20]
LongRoPE: Extending LLM context window beyond 2 million tokens
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024
-
[21]
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2024
work page 2024
-
[22]
Data engineering for scaling language models to 128k context, 2024
Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context, 2024
work page 2024
- [23]
-
[24]
Agqa: A benchmark for compositional spatio-temporal reasoning
Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021
work page 2021
-
[25]
Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024
work page 2024
-
[26]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021
work page 2021
-
[27]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023
work page 2023
-
[28]
Tgif-qa: Toward spatio-temporal reasoning in visual question answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017
work page 2017
-
[29]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023
work page 2023
-
[30]
arXiv preprint arXiv:2311.08046 , year=
Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023
-
[31]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding, 2024. 11
work page 2024
-
[32]
A diagram is worth a dozen images, 2016
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016
work page 2016
-
[33]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[34]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023
work page 2023
-
[35]
Llava-next: What else influences visual instruction tuning beyond data?, May 2024
Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024
work page 2024
-
[36]
Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024
Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024
work page 2024
-
[37]
Otter: A multi-modal model with in-context instruction tuning, 2023
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023
work page 2023
-
[38]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023
work page 2023
-
[39]
Videochat: Chat-centric video understanding, 2024
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024
work page 2024
-
[40]
Sequence paral- lelism: Long sequence training from system perspective
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence paral- lelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada, July
-
[41]
Association for Computational Linguistics
-
[42]
Llama-vid: An image is worth 2 tokens in large language models, 2023
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models, 2023
work page 2023
-
[43]
Video-llava: Learning united visual representation by alignment before projection, 2023
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023
work page 2023
-
[44]
Vila: On pre-training for visual language models, 2023
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023
work page 2023
-
[45]
World model on million-length video and language with ringattention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint, 2024
work page 2024
-
[46]
Ring attention with blockwise transformers for near-infinite context, 2023
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023
work page 2023
-
[47]
Improved baselines with visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023
work page 2023
-
[48]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[49]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023
work page 2023
-
[50]
St-llm: Large language models are effective temporal learners, 2024
Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners, 2024
work page 2024
-
[51]
Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023. 12
work page 2023
-
[52]
Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023
work page 2023
-
[53]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022
work page 2022
-
[54]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. Docvqa: A dataset for vqa on document images. corr abs/2007.00398 (2020). arXiv preprint arXiv:2007.00398, 2020
-
[55]
Mixtral 8x22b: Cheaper, better, faster, stronger, 2024
Mistral. Mixtral 8x22b: Cheaper, better, faster, stronger, 2024
work page 2024
- [56]
-
[57]
Reka core, flash, and edge: A series of powerful multi- modal language models
Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Dono- van Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, et al. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387, 2024
-
[58]
Yarn: Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[59]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021
work page 2021
-
[60]
Zero: Memory optimiza- tions toward training trillion parameter models, 2020
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020
work page 2020
-
[61]
Timechat: A time-sensitive multimodal large language model for long video understanding, 2024
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding, 2024
work page 2024
-
[62]
Code llama: Open foundation models for code, 2024
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...
work page 2024
-
[63]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024
work page 2024
-
[64]
Milebench: Benchmarking mllms in long context, 2024
Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. Milebench: Benchmarking mllms in long context, 2024
work page 2024
-
[65]
Moviechat: From dense token to sparse memory for long video understanding, 2024
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding, 2024
work page 2024
-
[66]
Roformer: Enhanced transformer with rotary position embedding, 2023
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023
work page 2023
-
[67]
Gemini: A family of highly capable multimodal models, 2024
Gemini Team. Gemini: A family of highly capable multimodal models, 2024
work page 2024
- [68]
- [69]
- [70]
-
[71]
Llama: Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. 13
work page 2023
-
[72]
Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long- context capability of multimodal large language models, 2024
work page 2024
-
[73]
Lvbench: An extreme long video understanding benchmark, 2024
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2024
work page 2024
-
[74]
Needle in a multimodal haystack, 2024
Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, and Wenhai Wang. Needle in a multimodal haystack, 2024
work page 2024
-
[75]
Star: A benchmark for situated reasoning in real-world videos
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
work page 2021
- [76]
-
[77]
Next-qa: Next phase of question- answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021
work page 2021
-
[78]
Next-qa:next phase of question- answering to explaining temporal actions, 2021
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question- answering to explaining temporal actions, 2021
work page 2021
-
[79]
Funqa: Towards surprising video comprehension
Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. GitHub repository, 2023
work page 2023
-
[80]
Effective long-context scaling of foundation models, 2023
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.