arxiv: 2406.16852 · v2 · submitted 2024-06-24 · 💻 cs.CV

Recognition: no theorem link

Long Context Transfer from Language to Vision

Peiyuan Zhang , Kaichen Zhang , Bo Li , Guangtao Zeng , Jingkang Yang , Yuanhan Zhang , Ziyue Wang , Haoran Tan

show 2 more authors

Chunyuan Li Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords long context transferlarge multimodal modelsvideo understandingcontext extrapolationvisual tokensneedle in a haystacklong videos

0 comments

The pith

Extending language model context length transfers directly to vision, letting multimodal models handle orders of magnitude more visual tokens without video training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that long-context abilities acquired purely from text sequences carry over to visual token sequences in the same transformer architecture. By extrapolating the language backbone's context window, large multimodal models gain the ability to process vastly longer videos or image sequences. The authors create the V-NIAH benchmark to measure this generalization in vision and demonstrate LongVA, which ingests 2000 frames or over 200,000 visual tokens. This yields state-of-the-art Video-MME scores among 7B-scale models through denser frame sampling. A sympathetic reader cares because the approach bypasses token-reduction resamplers and video-specific training.

Core claim

By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer. The resulting Long Video Assistant processes 2000 frames or over 200K visual tokens and reaches state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames.

What carries the argument

Long context transfer: the direct application of language-model context extrapolation to sequences of visual tokens inside the shared transformer.

Load-bearing premise

Long-context capabilities learned purely from text sequences transfer effectively and without significant degradation to sequences of visual tokens in the same transformer architecture.

What would settle it

If a model whose context length has been extrapolated shows no improvement over a standard-context model when both are tested on long visual sequences in V-NIAH, the transfer claim is falsified.

read the original abstract

Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extending LM context length transfers to vision tokens well enough for long videos without video training, backed by a new synthetic benchmark and competitive results.

read the letter

The main point is that you can take an existing long-context language model, hook it up to a vision encoder, and suddenly process thousands of video frames or over 200k visual tokens with no video-specific training. They demonstrate this with LongVA, introduce V-NIAH as a needle-in-haystack test for visual sequences, and report SOTA numbers on Video-MME for 7B-scale models by just sampling more frames densely. The open-source release and ablations on transfer properties make the empirical claim easy to check and build on. That is genuinely useful for anyone scaling multimodal models to longer inputs. The approach avoids the usual resampler tricks and directly leverages LM progress, which keeps things simple and low-cost. The results look solid on the reported benchmarks, and the synthetic test isolates the long-context aspect better than most video evals. The soft spot is the transfer mechanism itself. Visual tokens differ from text in distribution and structure, so it is not obvious that RoPE scaling or position interpolation will behave the same way once the input shifts from language to vision embeddings. Their ablations test retrieval on V-NIAH but do not fully rule out incidental effects from multimodal pretraining or tokenization choices. Real-world long videos with complex motion and semantics could expose gaps that the synthetic benchmark misses. This paper is aimed at people working on efficient long-context LMMs and video understanding. Readers who need a practical way to increase visual token capacity or a new benchmark for testing it will get value from the experiments and code. The work is coherent and engages the literature directly, so it deserves a serious referee even if some generalization questions remain open for follow-up.

Referee Report

3 major / 3 minor

Summary. The paper claims that long-context capabilities acquired by the language model backbone of LMMs transfer to the vision modality, allowing models to process orders of magnitude more visual tokens (e.g., 200K tokens from 2000 video frames) without any video training. This 'long context transfer' is demonstrated via context-length extrapolation on the LM (e.g., RoPE scaling), supported by ablations, the new synthetic V-NIAH benchmark for long visual retrieval, and SOTA results on Video-MME among 7B-scale models using their LongVA model that densely samples frames.

Significance. If the transfer holds after proper controls, the result is significant: it offers a simple, training-free way to scale LMMs to long videos by reusing text long-context techniques, avoiding expensive video data collection and multimodal long-context pretraining. The V-NIAH benchmark provides a controlled, synthetic testbed for vision long-context generalization, and open-sourcing the model and code strengthens the contribution for the community.

major comments (3)

[§3, §4.1] §3 (Method) and §4.1 (V-NIAH): The central transfer claim requires that text-derived positional extrapolation generalizes to visual token sequences, yet no analysis isolates whether observed long-context behavior on V-NIAH stems from the LM backbone's text pretraining versus incidental effects of the multimodal pretraining on the shared transformer; a control experiment using a short-context LM backbone fine-tuned only on short multimodal data is needed to establish causality.
[§5.2, Table 2] §5.2 (Ablations) and Table 2: The reported gains on Video-MME from increasing frame count rely on dense sampling enabled by extended context, but without reporting the performance curve versus context length or error analysis (e.g., attention collapse or information loss) for sequences beyond the original training length, it is unclear whether degradation occurs or if the transfer is lossless as claimed.
[§4.1] §4.1 (V-NIAH construction): The synthetic benchmark inserts 'needles' into visual 'haystacks,' but the paper does not specify how the visual token embeddings for needles and haystacks are sampled from the vision encoder's output distribution; without matching real video statistics or providing a text-to-vision token distribution comparison, the benchmark may overestimate transferability.

minor comments (3)

[Figure 1] Figure 1: The diagram of token flow could clarify the exact point at which context extrapolation is applied (before or after vision encoder projection) and include a quantitative comparison of token counts.
[Abstract, §1] Abstract and §1: The phrase 'orders of magnitude more visual tokens' should be quantified with the precise scaling factor achieved relative to the base model's context length.
[Related Work] Related work section: A few recent long-video LMM papers using resamplers or memory mechanisms are not cited; adding them would better position the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve the paper.

read point-by-point responses

Referee: [§3, §4.1] §3 (Method) and §4.1 (V-NIAH): The central transfer claim requires that text-derived positional extrapolation generalizes to visual token sequences, yet no analysis isolates whether observed long-context behavior on V-NIAH stems from the LM backbone's text pretraining versus incidental effects of the multimodal pretraining on the shared transformer; a control experiment using a short-context LM backbone fine-tuned only on short multimodal data is needed to establish causality.

Authors: We agree that a control experiment with a short-context LM backbone would help isolate the contribution of text pretraining to the observed transfer. Our current setup uses an LM backbone pretrained on long text contexts, with subsequent multimodal training on shorter sequences. The scaling behaviors in our ablations support that the long-context capabilities originate from the LM component. However, conducting the suggested control would require substantial additional compute for retraining. We will add a clarification in the method section and a discussion in the limitations regarding this aspect. revision: partial
Referee: [§5.2, Table 2] §5.2 (Ablations) and Table 2: The reported gains on Video-MME from increasing frame count rely on dense sampling enabled by extended context, but without reporting the performance curve versus context length or error analysis (e.g., attention collapse or information loss) for sequences beyond the original training length, it is unclear whether degradation occurs or if the transfer is lossless as claimed.

Authors: We have conducted experiments varying the context length and will include a performance curve versus context length in the revised version of Table 2 or as a new figure in Section 5.2. For error analysis, our results indicate stable performance without notable degradation or attention collapse up to the tested lengths; we will incorporate a brief analysis of attention maps and potential information loss in the ablations section. revision: yes
Referee: [§4.1] §4.1 (V-NIAH construction): The synthetic benchmark inserts 'needles' into visual 'haystacks,' but the paper does not specify how the visual token embeddings for needles and haystacks are sampled from the vision encoder's output distribution; without matching real video statistics or providing a text-to-vision token distribution comparison, the benchmark may overestimate transferability.

Authors: We will revise Section 4.1 to specify the construction details: the haystack frames are sampled from a diverse set of videos, and needle frames are specific target images processed through the same vision encoder. The token embeddings are the direct outputs of the vision encoder without additional sampling modifications. We acknowledge the lack of explicit distribution matching and will add a comparison note or discussion on how the synthetic setup relates to real video token statistics to address potential overestimation concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical extrapolation validated on benchmarks

full rationale

The paper's core claim—that extrapolating the language backbone's context length enables LMMs to handle far more visual tokens without video-specific training—is presented as an observed empirical phenomenon, not a mathematical derivation. It is supported by direct experiments on the introduced V-NIAH synthetic benchmark and Video-MME, with ablations showing behavior under extended contexts. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The 'long context transfer' label is merely descriptive naming of the observed transfer effect, and the work remains self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical transfer of context extrapolation from text to vision tokens; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Transformer attention mechanisms can extrapolate to context lengths beyond those seen during training when positional encodings allow it.
Invoked when the authors extrapolate the language backbone context length.

pith-pipeline@v0.9.0 · 5529 in / 1145 out tokens · 47961 ms · 2026-05-12T07:00:08.278129+00:00 · methodology

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassi...
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
cs.MM 2026-04 unverdicted novelty 7.0

MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging
cs.CV 2026-04 unverdicted novelty 7.0

MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration
cs.AI 2026-04 unverdicted novelty 6.0

HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing
cs.DB 2026-04 unverdicted novelty 6.0

SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% toke...
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
Small Vision-Language Models are Smart Compressors for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 6.0

GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
EgoSelf: From Memory to Personalized Egocentric Assistant
cs.CV 2026-04 unverdicted novelty 5.0

EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 34 Pith papers · 3 internal anchors

[1]

Llm testneedleinahaystack, 2023

Arize AI. Llm testneedleinahaystack, 2023

work page 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...

work page 2022
[3]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

work page 2015
[4]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint ar...

work page internal anchor Pith review arXiv 2023
[5]

Longalign: A recipe for long context alignment of large language models, 2024

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. Longalign: A recipe for long context alignment of large language models, 2024

work page 2024
[6]

ntkaware scaled rope allows llama models to have, 2023

bloc97. ntkaware scaled rope allows llama models to have, 2023

work page 2023
[7]

Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2024

bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2024

work page 2024
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

work page 1901
[9]

Matryoshka multimodal models, 2024

Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models, 2024

work page 2024
[10]

cerebras slimpajama-627b, 2023

Cerebras. cerebras slimpajama-627b, 2023

work page 2023
[11]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models, 2024

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models, 2024

work page 2024
[12]

Sharegpt4video: Improving video understanding and generation with better captions, 2024

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions, 2024

work page 2024
[13]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. 10

work page internal anchor Pith review arXiv 2023
[14]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review arXiv 2023
[15]

Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms, 2024

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms, 2024

work page 2024
[16]

Generating long sequences with sparse transformers, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019

work page 2019
[17]

Introducing command r+: A scalable llm built for business, 2024

Cohere. Introducing command r+: A scalable llm built for business, 2024

work page 2024
[18]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[19]

Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

work page 2023
[20]

LongRoPE: Extending LLM context window beyond 2 million tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

work page arXiv 2024
[21]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2024

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2024

work page 2024
[22]

Data engineering for scaling language models to 128k context, 2024

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context, 2024

work page 2024
[23]

Llmtest needleinahaystack, 2024

Kamradt Gregory. Llmtest needleinahaystack, 2024

work page 2024
[24]

Agqa: A benchmark for compositional spatio-temporal reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021

work page 2021
[25]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024

work page 2024
[26]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021
[27]

Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

work page 2023
[28]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017

work page 2017
[29]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

work page 2023
[30]

arXiv preprint arXiv:2311.08046 , year=

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023

work page arXiv 2023
[31]

Chat-univi: Unified visual representation empowers large language models with image and video understanding, 2024

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding, 2024. 11

work page 2024
[32]

A diagram is worth a dozen images, 2016

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016

work page 2016
[33]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[34]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

work page 2023
[35]

Llava-next: What else influences visual instruction tuning beyond data?, May 2024

Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024

work page 2024
[36]

Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

work page 2024
[37]

Otter: A multi-modal model with in-context instruction tuning, 2023

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023

work page 2023
[38]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

work page 2023
[39]

Videochat: Chat-centric video understanding, 2024

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024

work page 2024
[40]

Sequence paral- lelism: Long sequence training from system perspective

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence paral- lelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada, July

work page
[41]

Association for Computational Linguistics

work page
[42]

Llama-vid: An image is worth 2 tokens in large language models, 2023

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models, 2023

work page 2023
[43]

Video-llava: Learning united visual representation by alignment before projection, 2023

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023

work page 2023
[44]

Vila: On pre-training for visual language models, 2023

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023

work page 2023
[45]

World model on million-length video and language with ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint, 2024

work page 2024
[46]

Ring attention with blockwise transformers for near-infinite context, 2023

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023

work page 2023
[47]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[48]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[49]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

work page 2023
[50]

St-llm: Large language models are effective temporal learners, 2024

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners, 2024

work page 2024
[51]

Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023. 12

work page 2023
[52]

Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023

work page 2023
[53]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

work page 2022
[54]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. Docvqa: A dataset for vqa on document images. corr abs/2007.00398 (2020). arXiv preprint arXiv:2007.00398, 2020

work page arXiv 2007
[55]

Mixtral 8x22b: Cheaper, better, faster, stronger, 2024

Mistral. Mixtral 8x22b: Cheaper, better, faster, stronger, 2024

work page 2024
[56]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024

work page 2024
[57]

Reka core, flash, and edge: A series of powerful multi- modal language models

Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Dono- van Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, et al. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387, 2024

work page arXiv 2024
[58]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[59]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021
[60]

Zero: Memory optimiza- tions toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

work page 2020
[61]

Timechat: A time-sensitive multimodal large language model for long video understanding, 2024

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding, 2024

work page 2024
[62]

Code llama: Open foundation models for code, 2024

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page 2024
[63]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024

work page 2024
[64]

Milebench: Benchmarking mllms in long context, 2024

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. Milebench: Benchmarking mllms in long context, 2024

work page 2024
[65]

Moviechat: From dense token to sparse memory for long video understanding, 2024

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding, 2024

work page 2024
[66]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

work page 2023
[67]

Gemini: A family of highly capable multimodal models, 2024

Gemini Team. Gemini: A family of highly capable multimodal models, 2024

work page 2024
[68]

Palm 2 technical report, 2023

PaLM Team. Palm 2 technical report, 2023

work page 2023
[69]

Introducing qwen-vl, 2024

Qwen Team. Introducing qwen-vl, 2024

work page 2024
[70]

Qwen2 technical report, 2024

Qwen2 Team. Qwen2 technical report, 2024

work page 2024
[71]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. 13

work page 2023
[72]

Multimodal needle in a haystack: Benchmarking long- context capability of multimodal large language models, 2024

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long- context capability of multimodal large language models, 2024

work page 2024
[73]

Lvbench: An extreme long video understanding benchmark, 2024

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2024

work page 2024
[74]

Needle in a multimodal haystack, 2024

Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, and Wenhai Wang. Needle in a multimodal haystack, 2024

work page 2024
[75]

Star: A benchmark for situated reasoning in real-world videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[76]

Grok-1.5 vision preview, apr 2024

xAI. Grok-1.5 vision preview, apr 2024

work page 2024
[77]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

work page 2021
[78]

Next-qa:next phase of question- answering to explaining temporal actions, 2021

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question- answering to explaining temporal actions, 2021

work page 2021
[79]

Funqa: Towards surprising video comprehension

Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. GitHub repository, 2023

work page 2023
[80]

Effective long-context scaling of foundation models, 2023

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023

work page 2023

Showing first 80 references.