arxiv: 2605.13831 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Chaoyi Huang, Haodong Duan, Ji Luo, Lishu Luo, Shen Yan, Shuai Peng, Sihang Yuan, Sijin Wu, Weiwei Liu, Yangqiu Song, Yi Lin, Zhaowei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords long-context vision-language modelscontinued pre-traininglong-document VQAcontext generalizationmultimodal retrievalsequence length distributionvision-language models

0 comments

The pith

Balanced long-document VQA data lets a vision-language model trained to 128K generalize to 512K contexts without further training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic study of continued pre-training to extend a 7B vision-language model from 32K to 128K context, focusing on the design of long-document data mixtures. It demonstrates that long-document visual question answering outperforms OCR transcription, that balanced length distributions across sequences yield better generalization than data concentrated at the target length, and that retrieval-heavy mixtures with limited reasoning data suffice for task diversity. These ablations produce a practical recipe that yields MMProLong, a model improving long-document VQA by 7.1 percent while preserving short-context performance and extending naturally to 256K and 512K evaluation lengths. The approach further transfers to webpage needle retrieval, vision-text compression, and long-video understanding without task-specific fine-tuning.

Core claim

Continued pre-training on a balanced mixture of long-document VQA data enables a 7B LVLM to reach 128K context while generalizing to 256K and 512K contexts, delivering a 7.1 percent gain on long-document VQA benchmarks. Ablations establish that VQA outperforms OCR, balanced length sampling beats target-length concentration, retrieval remains the dominant bottleneck, and instruction-formatted long data largely preserves short-context capabilities without requiring short-data mixing.

What carries the argument

The long-context continued pre-training recipe built around a balanced sequence-length distribution of long-document VQA data.

If this is right

Long-document VQA accuracy rises by 7.1 percent after 128K training.
Strong performance holds at 256K and 512K evaluation lengths without additional training.
The model transfers to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision.
Short-context capabilities remain intact when training uses only instruction-formatted long data.
Retrieval-heavy mixtures with modest reasoning data support task diversity while minimizing data volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same balanced VQA recipe could be tested on larger base models or different vision encoders to check whether the generalization pattern scales.
Reducing reliance on short-context data mixtures simplifies the overall training pipeline for long-context LVLMs.
Extending the balanced distribution to even longer target windows (such as 1M tokens) might further push the generalization boundary observed here.

Load-bearing premise

The performance gains and generalization beyond the training window arise primarily from the long-document VQA mixture and its balanced length distribution rather than from the base model, total token count, or other unablated factors.

What would settle it

Training an identical base model on the same 5B-token budget but with only OCR data or with length distributions heavily skewed toward 128K, then measuring whether long-document VQA scores and 512K generalization match or fall short of the reported results.

read the original abstract

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical data-mixture recipe for pushing a 7B LVLM to 128K context with modest tokens, but the 256K/512K generalization claims rest on an unstated inference-time position scaling step.

read the letter

The main takeaway is that continued pre-training with a balanced long-document VQA mixture beats OCR-heavy data and target-length-only sampling for extending context in vision-language models. They start from Qwen2.5-VL-7B, use 5B tokens to reach 128K, report a 7.1% lift on long VQA, and show the model still works at 256K and 512K without further training. The ablations also indicate that retrieval-heavy mixes with modest reasoning data preserve short-context ability better than heavy short-data mixing.

Referee Report

1 major / 1 minor

Summary. The paper presents a systematic empirical study of long-context continued pre-training for 7B vision-language models, extending Qwen2.5-VL-7B from 32K to 128K context using a 5B-token budget. It finds that long-document VQA data outperforms OCR, that balanced length distributions outperform target-length-focused data, and that retrieval-heavy mixtures with modest reasoning data are effective. The resulting MMProLong model reports a 7.1% gain on long-document VQA, preserves short-context performance, and generalizes to 256K/512K contexts plus other tasks (webpage needle retrieval, vision-text compression, long-video understanding) without further training.

Significance. If the central empirical claims hold after clarification of inference details, the work supplies a practical, low-budget LongPT recipe for LVLMs and concrete evidence that data-mixture design (balanced lengths, retrieval emphasis) can drive both in-window gains and zero-shot extrapolation. This is a useful contribution to the still-nascent literature on long-context multimodal training.

major comments (1)

[Abstract and Methods] The headline generalization claim (strong performance at 256K and 512K beyond the 128K training window) is load-bearing yet the manuscript does not state which inference-time RoPE scaling method (linear, NTK, YaRN, etc.) was applied. Because extrapolation beyond training length almost always requires such scaling, the causal attribution to the VQA mixture and balanced distribution cannot be isolated without an explicit ablation or disclosure of the inference configuration.

minor comments (1)

[Ablations] Ablations are summarized at a high level; reporting full per-run statistics, error bars, and confirmation that post-hoc data-selection decisions did not inflate the 7.1% figure would strengthen the empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the requested clarification into the revised version.

read point-by-point responses

Referee: [Abstract and Methods] The headline generalization claim (strong performance at 256K and 512K beyond the 128K training window) is load-bearing yet the manuscript does not state which inference-time RoPE scaling method (linear, NTK, YaRN, etc.) was applied. Because extrapolation beyond training length almost always requires such scaling, the causal attribution to the VQA mixture and balanced distribution cannot be isolated without an explicit ablation or disclosure of the inference configuration.

Authors: We agree that the inference-time RoPE scaling method must be explicitly disclosed, as it is necessary for interpreting the extrapolation results. This detail was omitted from the original Methods section. In our experiments we applied linear RoPE interpolation at inference time for contexts beyond 128K. We will revise the manuscript to state this configuration clearly and to note that all reported 256K/512K results were obtained under the same inference setup used for the 128K training window. While we do not provide an ablation across scaling methods (as our focus is on training-data design), we will add a brief limitations paragraph acknowledging that different scaling choices could modulate the observed generalization and that the gains are reported under the disclosed linear-interpolation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation study with results grounded in reported experiments

full rationale

The paper conducts an empirical study of continued pre-training for long-context LVLMs, using ablations on data mixtures (long-document VQA vs OCR, balanced vs target-length distributions) and reporting performance metrics on VQA scores and generalization to 256K/512K contexts. No mathematical derivations, equations, or first-principles results are presented that reduce by construction to fitted inputs or self-referential definitions. Central claims rest on observed experimental outcomes from a 5B-token budget training run on Qwen2.5-VL-7B, with no load-bearing self-citation chains or ansatzes smuggled via prior work. Generalization beyond the 128K window is stated as an empirical finding without being defined into existence. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of continued pre-training and the validity of the chosen evaluation benchmarks. No new entities are postulated. Data mixture proportions are chosen but not enumerated as explicit free parameters in the abstract.

axioms (1)

domain assumption Continued pre-training on instruction-formatted long data preserves short-context capabilities
Invoked to justify not mixing short data; appears in the third key finding.

pith-pipeline@v0.9.0 · 5660 in / 1250 out tokens · 40193 ms · 2026-05-14T19:12:55.734913+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 40 canonical work pages · 16 internal anchors

[1]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025.URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025.URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

2025
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. Technical report, Technical report, Bytedance, 2025. URL https://lf3-static. bytednsdoc. com

2025
[5]

Mmlong- bench: Benchmarking long-context vision-language models effectively and thoroughly.arXiv preprint arXiv:2505.10610, 2025

Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, et al. Mmlongbench: Benchmarking long-context vision-language models effectively and thoroughly.arXiv preprint arXiv:2505.10610, 2025

work page arXiv 2025
[6]

Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models.arXiv preprint arXiv:2406.11230, 2024

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models.arXiv preprint arXiv:2406.11230, 2024

work page arXiv 2024
[7]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

2024
[8]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

2025
[9]

Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page arXiv 2025
[10]

Browsecomp-v3: A visual, vertical, and verifiable benchmark for multimodal browsing agents.arXiv preprint arXiv:2602.12876, 2026

Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, et al. Browsecomp-v3: A visual, vertical, and verifiable benchmark for multimodal browsing agents.arXiv preprint arXiv:2602.12876, 2026

work page arXiv 2026
[11]

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, et al. Clawbench: Can ai agents complete everyday online tasks?arXiv preprint arXiv:2604.08523, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Gemini 3.1 pro: A smarter model for your most complex tasks.https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , Feb 2026

The Gemini Team. Gemini 3.1 pro: A smarter model for your most complex tasks.https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , Feb 2026. Google The Keyword Blog

2026
[13]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, Mar 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, Mar 2026. OpenAI Blog

2026
[14]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024

work page arXiv 2024
[17]

Needle in a multimodal haystack.Advances in Neural Information Processing Systems, 37:20540–20565, 2024

Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack.Advances in Neural Information Processing Systems, 37:20540–20565, 2024

2024
[18]

Vtcbench: Can vision-language models understand long context with vision-text compression?arXiv preprint arXiv:2512.15649, 2025

Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, and Zhaoxiang Zhang. Vtcbench: Can vision-language models understand long context with vision-text compression?arXiv preprint arXiv:2512.15649, 2025. 11

work page arXiv 2025
[19]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025

2025
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Introducing gemini 2.0: our new ai model for the agentic era, 2024

Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. URL https://blog.google/ technology/google-deepmind/google-gemini-ai-update-december-2024/#ceo-message

2024
[23]

Gemini 2.5: Our most intelligent ai model.URL https://blog.google/technology/google-deepmind/gemini- model-thinking-updates-march-2025/, 2025

Google. Gemini 2.5: Our most intelligent ai model.URL https://blog.google/technology/google-deepmind/gemini- model-thinking-updates-march-2025/, 2025

2025
[24]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations
[26]

Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

work page arXiv 2024
[27]

Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024

Yikai Zhang, Junlong Li, and Pengfei Liu. Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024

work page arXiv 2024
[28]

Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400, 2023

Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400, 2023

work page arXiv 2023
[29]

Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023

work page arXiv 2023
[30]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 3(7), 2024

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 3(7), 2024

work page arXiv 2024
[32]

Unlimiformer: Long-range transformers with unlimited length input.Advances in Neural Information Processing Systems, 36:35522–35543, 2023

Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. Unlimiformer: Long-range transformers with unlimited length input.Advances in Neural Information Processing Systems, 36:35522–35543, 2023

2023
[33]

Llm maybe longlm: Self-extend llm context window without tuning.arXiv preprint arXiv:2401.01325, 2024

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning.arXiv preprint arXiv:2401.01325, 2024

work page arXiv 2024
[34]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...

2024
[35]

Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024

work page arXiv 2024
[36]

Qwen2.5-1m technical report.ArXiv, abs/2501.15383, 2025

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyang Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical r...

work page arXiv 2025
[37]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/claude-opus-4-7, Apr 2026. Anthropic News

2026
[38]

How to train your long-context visual document model.arXiv preprint arXiv:2602.15257, 2026

Austin Veselka. How to train your long-context visual document model.arXiv preprint arXiv:2602.15257, 2026. 12

work page arXiv 2026
[39]

Mistral small 3.1.https://mistral.ai/news/mistral-small-3-1/, Mar 2025

Mistral AI. Mistral small 3.1.https://mistral.ai/news/mistral-small-3-1/, Mar 2025. Mistral AI Research

2025
[40]

Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

work page arXiv 2024
[41]

Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy.arXiv preprint arXiv:2502.05177, 2025

Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, et al. Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy.arXiv preprint arXiv:2502.05177, 2025

work page arXiv 2025
[42]

Eea: Exploration-exploitation agent for long video understanding.arXiv preprint arXiv:2512.03500, 2025

Te Yang, Xiangyu Zhu, Bo Wang, Quan Chen, Peng Jiang, and Zhen Lei. Eea: Exploration-exploitation agent for long video understanding.arXiv preprint arXiv:2512.03500, 2025

work page arXiv 2025
[43]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

thinking with long videos

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, et al. Longvt: Incentivizing" thinking with long videos" via native tool calling.arXiv preprint arXiv:2511.20785, 2025

work page arXiv 2025
[45]

Bolt: Boost large vision-language model without training for long-form video understanding

Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3318–3327, 2025

2025
[46]

Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

work page arXiv 2024
[47]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track
[48]

Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating

Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...

2025
[49]

Visual haystacks: A vision-centric needle-in-a-haystack benchmark.arXiv preprint arXiv:2407.13766, 2024

Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E Gonzalez, Trevor Darrell, and David M Chan. Visual haystacks: A vision-centric needle-in-a-haystack benchmark.arXiv preprint arXiv:2407.13766, 2024

work page arXiv 2024
[50]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025

2025
[51]

Dynamically scaled rope further increases performance of long context llama with zero fine-tuning,

Emozilla. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning,
[52]

URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_ increases/
[53]

Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram’e, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Gemma 4: Our most intelligent open models.URL https://deepmind.google/models/gemma/gemma-4/, 2026

Google DeepMind. Gemma 4: Our most intelligent open models.URL https://deepmind.google/models/gemma/gemma-4/, 2026

2026
[55]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Slidevqa: A dataset for document visual question answering on multiple images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13636–13645, 2023

2023
[57]

Slimpajama: A 627b token cleaned and deduplicated version of redpajama, 2023

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. Slimpajama: A 627b token cleaned and deduplicated version of redpajama, 2023

2023
[58]

The stack: 3 tb of permissively licensed source code

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code.arXiv preprint arXiv:2211.15533, 2022

work page arXiv 2022
[59]

Veomni: Scaling any modality model training with model-centric distributed recipe zoo

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025

work page arXiv 2025
[60]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations
[61]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

2024
[62]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Cong He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Ying Xiong, Wenwen Qu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, Apr 2026

OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, Apr 2026. OpenAI Blog

2026
[65]

Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023. 14

2023
[66]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[67]

Grok-1.5 vision preview.https://x.ai/blog/grok-1.5v, 2024

X.AI. Grok-1.5 vision preview.https://x.ai/blog/grok-1.5v, 2024

2024
[68]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

2024
[69]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

2025
[70]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

2024
[72]

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023

bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023. URLhttps://www.reddit.com/r/LocalLLaMA/comments/ 14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

2023
[73]

by parts

bloc97. Add NTK-Aware interpolation "by parts" correction, 2023. URLhttps://github.com/jquesnelle/ scaled-rope/pull/1

2023
[74]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[75]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 15 Appendix A Final Recipe and Implementation A.1 Final LongPT Recipe We summarize the final LongPT recipe ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

John Doe

String: General text, names, short sentences, or short phrases found in the document. - Example: "John Doe", "Financial Report", "The project was completed on time."
[77]

42", "2023

Integer: Whole numbers representing counts, years, page numbers, etc. - Example: "42", "2023", "100"

2023
[78]

12.5", "$45.20

Float: Numbers with decimal points, including currency, percentages, or scientific measurements. - Example: "12.5", "$45.20", "98.5%"
[79]

Apple",

List: A collection of multiple items, names, or values. - Example: ["Apple", "Banana", "Orange"], ["Item A", "Item B"]. Your output format is: [Evidence Description]: [JSON]: {"question": <question>, "answer": <answer>, "answer_format": <answer_format>, "evidence_pages": [<page_index>], "evidence_sources": [<evidence_sources>]} <> <in-context exemplars_1>...
[80]

What is the difference in revenue between Q1 and Q2?

Calculate: Performing arithmetic operations on data found in the document. 2. Compare: Comparing values to identify trends, maximums, minimums, or relative sizes. 3. Count: Counting the frequency of specific items, keywords, or visual elements that satisfy a condition. The answer must require the aforementioned processing step rather than being directly v...

work page arXiv