pith. machine review for the scientific record. sign in

arxiv: 2406.16852 · v2 · submitted 2024-06-24 · 💻 cs.CV

Recognition: no theorem link

Long Context Transfer from Language to Vision

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords long context transferlarge multimodal modelsvideo understandingcontext extrapolationvisual tokensneedle in a haystacklong videos
0
0 comments X

The pith

Extending language model context length transfers directly to vision, letting multimodal models handle orders of magnitude more visual tokens without video training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that long-context abilities acquired purely from text sequences carry over to visual token sequences in the same transformer architecture. By extrapolating the language backbone's context window, large multimodal models gain the ability to process vastly longer videos or image sequences. The authors create the V-NIAH benchmark to measure this generalization in vision and demonstrate LongVA, which ingests 2000 frames or over 200,000 visual tokens. This yields state-of-the-art Video-MME scores among 7B-scale models through denser frame sampling. A sympathetic reader cares because the approach bypasses token-reduction resamplers and video-specific training.

Core claim

By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer. The resulting Long Video Assistant processes 2000 frames or over 200K visual tokens and reaches state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames.

What carries the argument

Long context transfer: the direct application of language-model context extrapolation to sequences of visual tokens inside the shared transformer.

Load-bearing premise

Long-context capabilities learned purely from text sequences transfer effectively and without significant degradation to sequences of visual tokens in the same transformer architecture.

What would settle it

If a model whose context length has been extrapolated shows no improvement over a standard-context model when both are tested on long visual sequences in V-NIAH, the transfer claim is falsified.

read the original abstract

Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that long-context capabilities acquired by the language model backbone of LMMs transfer to the vision modality, allowing models to process orders of magnitude more visual tokens (e.g., 200K tokens from 2000 video frames) without any video training. This 'long context transfer' is demonstrated via context-length extrapolation on the LM (e.g., RoPE scaling), supported by ablations, the new synthetic V-NIAH benchmark for long visual retrieval, and SOTA results on Video-MME among 7B-scale models using their LongVA model that densely samples frames.

Significance. If the transfer holds after proper controls, the result is significant: it offers a simple, training-free way to scale LMMs to long videos by reusing text long-context techniques, avoiding expensive video data collection and multimodal long-context pretraining. The V-NIAH benchmark provides a controlled, synthetic testbed for vision long-context generalization, and open-sourcing the model and code strengthens the contribution for the community.

major comments (3)
  1. [§3, §4.1] §3 (Method) and §4.1 (V-NIAH): The central transfer claim requires that text-derived positional extrapolation generalizes to visual token sequences, yet no analysis isolates whether observed long-context behavior on V-NIAH stems from the LM backbone's text pretraining versus incidental effects of the multimodal pretraining on the shared transformer; a control experiment using a short-context LM backbone fine-tuned only on short multimodal data is needed to establish causality.
  2. [§5.2, Table 2] §5.2 (Ablations) and Table 2: The reported gains on Video-MME from increasing frame count rely on dense sampling enabled by extended context, but without reporting the performance curve versus context length or error analysis (e.g., attention collapse or information loss) for sequences beyond the original training length, it is unclear whether degradation occurs or if the transfer is lossless as claimed.
  3. [§4.1] §4.1 (V-NIAH construction): The synthetic benchmark inserts 'needles' into visual 'haystacks,' but the paper does not specify how the visual token embeddings for needles and haystacks are sampled from the vision encoder's output distribution; without matching real video statistics or providing a text-to-vision token distribution comparison, the benchmark may overestimate transferability.
minor comments (3)
  1. [Figure 1] Figure 1: The diagram of token flow could clarify the exact point at which context extrapolation is applied (before or after vision encoder projection) and include a quantitative comparison of token counts.
  2. [Abstract, §1] Abstract and §1: The phrase 'orders of magnitude more visual tokens' should be quantified with the precise scaling factor achieved relative to the base model's context length.
  3. [Related Work] Related work section: A few recent long-video LMM papers using resamplers or memory mechanisms are not cited; adding them would better position the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve the paper.

read point-by-point responses
  1. Referee: [§3, §4.1] §3 (Method) and §4.1 (V-NIAH): The central transfer claim requires that text-derived positional extrapolation generalizes to visual token sequences, yet no analysis isolates whether observed long-context behavior on V-NIAH stems from the LM backbone's text pretraining versus incidental effects of the multimodal pretraining on the shared transformer; a control experiment using a short-context LM backbone fine-tuned only on short multimodal data is needed to establish causality.

    Authors: We agree that a control experiment with a short-context LM backbone would help isolate the contribution of text pretraining to the observed transfer. Our current setup uses an LM backbone pretrained on long text contexts, with subsequent multimodal training on shorter sequences. The scaling behaviors in our ablations support that the long-context capabilities originate from the LM component. However, conducting the suggested control would require substantial additional compute for retraining. We will add a clarification in the method section and a discussion in the limitations regarding this aspect. revision: partial

  2. Referee: [§5.2, Table 2] §5.2 (Ablations) and Table 2: The reported gains on Video-MME from increasing frame count rely on dense sampling enabled by extended context, but without reporting the performance curve versus context length or error analysis (e.g., attention collapse or information loss) for sequences beyond the original training length, it is unclear whether degradation occurs or if the transfer is lossless as claimed.

    Authors: We have conducted experiments varying the context length and will include a performance curve versus context length in the revised version of Table 2 or as a new figure in Section 5.2. For error analysis, our results indicate stable performance without notable degradation or attention collapse up to the tested lengths; we will incorporate a brief analysis of attention maps and potential information loss in the ablations section. revision: yes

  3. Referee: [§4.1] §4.1 (V-NIAH construction): The synthetic benchmark inserts 'needles' into visual 'haystacks,' but the paper does not specify how the visual token embeddings for needles and haystacks are sampled from the vision encoder's output distribution; without matching real video statistics or providing a text-to-vision token distribution comparison, the benchmark may overestimate transferability.

    Authors: We will revise Section 4.1 to specify the construction details: the haystack frames are sampled from a diverse set of videos, and needle frames are specific target images processed through the same vision encoder. The token embeddings are the direct outputs of the vision encoder without additional sampling modifications. We acknowledge the lack of explicit distribution matching and will add a comparison note or discussion on how the synthetic setup relates to real video token statistics to address potential overestimation concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical extrapolation validated on benchmarks

full rationale

The paper's core claim—that extrapolating the language backbone's context length enables LMMs to handle far more visual tokens without video-specific training—is presented as an observed empirical phenomenon, not a mathematical derivation. It is supported by direct experiments on the introduced V-NIAH synthetic benchmark and Video-MME, with ablations showing behavior under extended contexts. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The 'long context transfer' label is merely descriptive naming of the observed transfer effect, and the work remains self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical transfer of context extrapolation from text to vision tokens; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Transformer attention mechanisms can extrapolate to context lengths beyond those seen during training when positional encodings allow it.
    Invoked when the authors extrapolate the language backbone context length.

pith-pipeline@v0.9.0 · 5529 in / 1145 out tokens · 47961 ms · 2026-05-12T07:00:08.278129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

  2. MedHorizon: Towards Long-context Medical Video Understanding in the Wild

    cs.CV 2026-05 unverdicted novelty 8.0

    MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

  3. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 8.0

    VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.

  4. ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

  5. MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassi...

  6. Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.

  7. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  8. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

  9. MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

    cs.MM 2026-04 unverdicted novelty 7.0

    MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...

  10. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  11. OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

  12. Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

    cs.CV 2026-04 unverdicted novelty 7.0

    MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.

  13. AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

  14. Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

    cs.CV 2026-03 unverdicted novelty 7.0

    SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

  15. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  16. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  17. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  18. Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

    cs.CV 2026-05 unverdicted novelty 6.0

    Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

  19. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  20. Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.

  21. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

  22. Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.

  23. HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

    cs.AI 2026-04 unverdicted novelty 6.0

    HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.

  24. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  25. SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing

    cs.DB 2026-04 unverdicted novelty 6.0

    SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% toke...

  26. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  27. Small Vision-Language Models are Smart Compressors for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

  28. HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.

  29. Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

  30. STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.

  31. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  32. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  33. Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 5.0

    Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.

  34. EgoSelf: From Memory to Personalized Egocentric Assistant

    cs.CV 2026-04 unverdicted novelty 5.0

    EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.

  35. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  36. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  37. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 34 Pith papers · 3 internal anchors

  1. [1]

    Llm testneedleinahaystack, 2023

    Arize AI. Llm testneedleinahaystack, 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...

  3. [3]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

  4. [4]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint ar...

  5. [5]

    Longalign: A recipe for long context alignment of large language models, 2024

    Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. Longalign: A recipe for long context alignment of large language models, 2024

  6. [6]

    ntkaware scaled rope allows llama models to have, 2023

    bloc97. ntkaware scaled rope allows llama models to have, 2023

  7. [7]

    Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2024

    bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2024

  8. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  9. [9]

    Matryoshka multimodal models, 2024

    Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models, 2024

  10. [10]

    cerebras slimpajama-627b, 2023

    Cerebras. cerebras slimpajama-627b, 2023

  11. [11]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models, 2024

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models, 2024

  12. [12]

    Sharegpt4video: Improving video understanding and generation with better captions, 2024

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions, 2024

  13. [13]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. 10

  14. [14]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

  15. [15]

    Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms, 2024

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms, 2024

  16. [16]

    Generating long sequences with sparse transformers, 2019

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019

  17. [17]

    Introducing command r+: A scalable llm built for business, 2024

    Cohere. Introducing command r+: A scalable llm built for business, 2024

  18. [18]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  19. [19]

    Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

  20. [20]

    LongRoPE: Extending LLM context window beyond 2 million tokens

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

  21. [21]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2024

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2024

  22. [22]

    Data engineering for scaling language models to 128k context, 2024

    Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context, 2024

  23. [23]

    Llmtest needleinahaystack, 2024

    Kamradt Gregory. Llmtest needleinahaystack, 2024

  24. [24]

    Agqa: A benchmark for compositional spatio-temporal reasoning

    Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021

  25. [25]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024

  26. [26]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  27. [27]

    Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

  28. [28]

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017

  29. [29]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  30. [30]

    arXiv preprint arXiv:2311.08046 , year=

    Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023

  31. [31]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding, 2024

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding, 2024. 11

  32. [32]

    A diagram is worth a dozen images, 2016

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016

  33. [33]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  34. [34]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

  35. [35]

    Llava-next: What else influences visual instruction tuning beyond data?, May 2024

    Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024

  36. [36]

    Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

  37. [37]

    Otter: A multi-modal model with in-context instruction tuning, 2023

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023

  38. [38]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

  39. [39]

    Videochat: Chat-centric video understanding, 2024

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024

  40. [40]

    Sequence paral- lelism: Long sequence training from system perspective

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence paral- lelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada, July

  41. [41]

    Association for Computational Linguistics

  42. [42]

    Llama-vid: An image is worth 2 tokens in large language models, 2023

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models, 2023

  43. [43]

    Video-llava: Learning united visual representation by alignment before projection, 2023

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023

  44. [44]

    Vila: On pre-training for visual language models, 2023

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023

  45. [45]

    World model on million-length video and language with ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint, 2024

  46. [46]

    Ring attention with blockwise transformers for near-infinite context, 2023

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023

  47. [47]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

  48. [48]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  49. [49]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

  50. [50]

    St-llm: Large language models are effective temporal learners, 2024

    Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners, 2024

  51. [51]

    Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023. 12

  52. [52]

    Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023

  53. [53]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

  54. [54]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. Docvqa: A dataset for vqa on document images. corr abs/2007.00398 (2020). arXiv preprint arXiv:2007.00398, 2020

  55. [55]

    Mixtral 8x22b: Cheaper, better, faster, stronger, 2024

    Mistral. Mixtral 8x22b: Cheaper, better, faster, stronger, 2024

  56. [56]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024

  57. [57]

    Reka core, flash, and edge: A series of powerful multi- modal language models

    Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Dono- van Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, et al. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387, 2024

  58. [58]

    Yarn: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2023

  59. [59]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

  60. [60]

    Zero: Memory optimiza- tions toward training trillion parameter models, 2020

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

  61. [61]

    Timechat: A time-sensitive multimodal large language model for long video understanding, 2024

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding, 2024

  62. [62]

    Code llama: Open foundation models for code, 2024

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  63. [63]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024

  64. [64]

    Milebench: Benchmarking mllms in long context, 2024

    Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. Milebench: Benchmarking mllms in long context, 2024

  65. [65]

    Moviechat: From dense token to sparse memory for long video understanding, 2024

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding, 2024

  66. [66]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

  67. [67]

    Gemini: A family of highly capable multimodal models, 2024

    Gemini Team. Gemini: A family of highly capable multimodal models, 2024

  68. [68]

    Palm 2 technical report, 2023

    PaLM Team. Palm 2 technical report, 2023

  69. [69]

    Introducing qwen-vl, 2024

    Qwen Team. Introducing qwen-vl, 2024

  70. [70]

    Qwen2 technical report, 2024

    Qwen2 Team. Qwen2 technical report, 2024

  71. [71]

    Llama: Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. 13

  72. [72]

    Multimodal needle in a haystack: Benchmarking long- context capability of multimodal large language models, 2024

    Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long- context capability of multimodal large language models, 2024

  73. [73]

    Lvbench: An extreme long video understanding benchmark, 2024

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2024

  74. [74]

    Needle in a multimodal haystack, 2024

    Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, and Wenhai Wang. Needle in a multimodal haystack, 2024

  75. [75]

    Star: A benchmark for situated reasoning in real-world videos

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  76. [76]

    Grok-1.5 vision preview, apr 2024

    xAI. Grok-1.5 vision preview, apr 2024

  77. [77]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

  78. [78]

    Next-qa:next phase of question- answering to explaining temporal actions, 2021

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question- answering to explaining temporal actions, 2021

  79. [79]

    Funqa: Towards surprising video comprehension

    Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. GitHub repository, 2023

  80. [80]

    Effective long-context scaling of foundation models, 2023

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023

Showing first 80 references.