Recognition: unknown
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Pith reviewed 2026-05-14 19:12 UTC · model grok-4.3
The pith
Balanced long-document VQA data lets a vision-language model trained to 128K generalize to 512K contexts without further training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continued pre-training on a balanced mixture of long-document VQA data enables a 7B LVLM to reach 128K context while generalizing to 256K and 512K contexts, delivering a 7.1 percent gain on long-document VQA benchmarks. Ablations establish that VQA outperforms OCR, balanced length sampling beats target-length concentration, retrieval remains the dominant bottleneck, and instruction-formatted long data largely preserves short-context capabilities without requiring short-data mixing.
What carries the argument
The long-context continued pre-training recipe built around a balanced sequence-length distribution of long-document VQA data.
If this is right
- Long-document VQA accuracy rises by 7.1 percent after 128K training.
- Strong performance holds at 256K and 512K evaluation lengths without additional training.
- The model transfers to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision.
- Short-context capabilities remain intact when training uses only instruction-formatted long data.
- Retrieval-heavy mixtures with modest reasoning data support task diversity while minimizing data volume.
Where Pith is reading between the lines
- The same balanced VQA recipe could be tested on larger base models or different vision encoders to check whether the generalization pattern scales.
- Reducing reliance on short-context data mixtures simplifies the overall training pipeline for long-context LVLMs.
- Extending the balanced distribution to even longer target windows (such as 1M tokens) might further push the generalization boundary observed here.
Load-bearing premise
The performance gains and generalization beyond the training window arise primarily from the long-document VQA mixture and its balanced length distribution rather than from the base model, total token count, or other unablated factors.
What would settle it
Training an identical base model on the same 5B-token budget but with only OCR data or with length distributions heavily skewed toward 128K, then measuring whether long-document VQA scores and 512K generalization match or fall short of the reported results.
read the original abstract
Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a systematic empirical study of long-context continued pre-training for 7B vision-language models, extending Qwen2.5-VL-7B from 32K to 128K context using a 5B-token budget. It finds that long-document VQA data outperforms OCR, that balanced length distributions outperform target-length-focused data, and that retrieval-heavy mixtures with modest reasoning data are effective. The resulting MMProLong model reports a 7.1% gain on long-document VQA, preserves short-context performance, and generalizes to 256K/512K contexts plus other tasks (webpage needle retrieval, vision-text compression, long-video understanding) without further training.
Significance. If the central empirical claims hold after clarification of inference details, the work supplies a practical, low-budget LongPT recipe for LVLMs and concrete evidence that data-mixture design (balanced lengths, retrieval emphasis) can drive both in-window gains and zero-shot extrapolation. This is a useful contribution to the still-nascent literature on long-context multimodal training.
major comments (1)
- [Abstract and Methods] The headline generalization claim (strong performance at 256K and 512K beyond the 128K training window) is load-bearing yet the manuscript does not state which inference-time RoPE scaling method (linear, NTK, YaRN, etc.) was applied. Because extrapolation beyond training length almost always requires such scaling, the causal attribution to the VQA mixture and balanced distribution cannot be isolated without an explicit ablation or disclosure of the inference configuration.
minor comments (1)
- [Ablations] Ablations are summarized at a high level; reporting full per-run statistics, error bars, and confirmation that post-hoc data-selection decisions did not inflate the 7.1% figure would strengthen the empirical claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the requested clarification into the revised version.
read point-by-point responses
-
Referee: [Abstract and Methods] The headline generalization claim (strong performance at 256K and 512K beyond the 128K training window) is load-bearing yet the manuscript does not state which inference-time RoPE scaling method (linear, NTK, YaRN, etc.) was applied. Because extrapolation beyond training length almost always requires such scaling, the causal attribution to the VQA mixture and balanced distribution cannot be isolated without an explicit ablation or disclosure of the inference configuration.
Authors: We agree that the inference-time RoPE scaling method must be explicitly disclosed, as it is necessary for interpreting the extrapolation results. This detail was omitted from the original Methods section. In our experiments we applied linear RoPE interpolation at inference time for contexts beyond 128K. We will revise the manuscript to state this configuration clearly and to note that all reported 256K/512K results were obtained under the same inference setup used for the 128K training window. While we do not provide an ablation across scaling methods (as our focus is on training-data design), we will add a brief limitations paragraph acknowledging that different scaling choices could modulate the observed generalization and that the gains are reported under the disclosed linear-interpolation protocol. revision: yes
Circularity Check
No circularity: empirical ablation study with results grounded in reported experiments
full rationale
The paper conducts an empirical study of continued pre-training for long-context LVLMs, using ablations on data mixtures (long-document VQA vs OCR, balanced vs target-length distributions) and reporting performance metrics on VQA scores and generalization to 256K/512K contexts. No mathematical derivations, equations, or first-principles results are presented that reduce by construction to fitted inputs or self-referential definitions. Central claims rest on observed experimental outcomes from a 5B-token budget training run on Qwen2.5-VL-7B, with no load-bearing self-citation chains or ansatzes smuggled via prior work. Generalization beyond the 128K window is stated as an empirical finding without being defined into existence. This is a standard non-circular empirical paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continued pre-training on instruction-formatted long data preserves short-context capabilities
Reference graph
Works this paper leans on
-
[1]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025.URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025
Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025.URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025
2025
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. Technical report, Technical report, Bytedance, 2025. URL https://lf3-static. bytednsdoc. com
2025
-
[5]
Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, et al. Mmlongbench: Benchmarking long-context vision-language models effectively and thoroughly.arXiv preprint arXiv:2505.10610, 2025
-
[6]
Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models.arXiv preprint arXiv:2406.11230, 2024
-
[7]
Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024
2024
-
[8]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
2025
-
[9]
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025
-
[10]
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, et al. Browsecomp-v3: A visual, vertical, and verifiable benchmark for multimodal browsing agents.arXiv preprint arXiv:2602.12876, 2026
-
[11]
ClawBench: Can AI Agents Complete Everyday Online Tasks?
Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, et al. Clawbench: Can ai agents complete everyday online tasks?arXiv preprint arXiv:2604.08523, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Gemini 3.1 pro: A smarter model for your most complex tasks.https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , Feb 2026
The Gemini Team. Gemini 3.1 pro: A smarter model for your most complex tasks.https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , Feb 2026. Google The Keyword Blog
2026
-
[13]
Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, Mar 2026
OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, Mar 2026. OpenAI Blog
2026
-
[14]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024
Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024
-
[17]
Needle in a multimodal haystack.Advances in Neural Information Processing Systems, 37:20540–20565, 2024
Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack.Advances in Neural Information Processing Systems, 37:20540–20565, 2024
2024
-
[18]
Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, and Zhaoxiang Zhang. Vtcbench: Can vision-language models understand long context with vision-text compression?arXiv preprint arXiv:2512.15649, 2025. 11
-
[19]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025
2025
-
[20]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Introducing gemini 2.0: our new ai model for the agentic era, 2024
Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. URL https://blog.google/ technology/google-deepmind/google-gemini-ai-update-december-2024/#ceo-message
2024
-
[23]
Gemini 2.5: Our most intelligent ai model.URL https://blog.google/technology/google-deepmind/gemini- model-thinking-updates-march-2025/, 2025
Google. Gemini 2.5: Our most intelligent ai model.URL https://blog.google/technology/google-deepmind/gemini- model-thinking-updates-march-2025/, 2025
2025
-
[24]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Yarn: Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations
-
[26]
Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024
-
[27]
Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024
Yikai Zhang, Junlong Li, and Pengfei Liu. Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024
-
[28]
Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400, 2023
-
[29]
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023
-
[30]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 3(7), 2024
-
[32]
Unlimiformer: Long-range transformers with unlimited length input.Advances in Neural Information Processing Systems, 36:35522–35543, 2023
Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. Unlimiformer: Long-range transformers with unlimited length input.Advances in Neural Information Processing Systems, 36:35522–35543, 2023
2023
-
[33]
Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning.arXiv preprint arXiv:2401.01325, 2024
-
[34]
Effective long-context scaling of foundation models
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...
2024
-
[35]
Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024
Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024
-
[36]
Qwen2.5-1m technical report.ArXiv, abs/2501.15383, 2025
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyang Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical r...
-
[37]
Introducing Claude Opus 4.7
Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/claude-opus-4-7, Apr 2026. Anthropic News
2026
-
[38]
How to train your long-context visual document model.arXiv preprint arXiv:2602.15257, 2026
Austin Veselka. How to train your long-context visual document model.arXiv preprint arXiv:2602.15257, 2026. 12
-
[39]
Mistral small 3.1.https://mistral.ai/news/mistral-small-3-1/, Mar 2025
Mistral AI. Mistral small 3.1.https://mistral.ai/news/mistral-small-3-1/, Mar 2025. Mistral AI Research
2025
-
[40]
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024
-
[41]
Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, et al. Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy.arXiv preprint arXiv:2502.05177, 2025
-
[42]
Te Yang, Xiangyu Zhu, Bo Wang, Quan Chen, Peng Jiang, and Zhen Lei. Eea: Exploration-exploitation agent for long video understanding.arXiv preprint arXiv:2512.03500, 2025
-
[43]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, et al. Longvt: Incentivizing" thinking with long videos" via native tool calling.arXiv preprint arXiv:2511.20785, 2025
-
[45]
Bolt: Boost large vision-language model without training for long-form video understanding
Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3318–3327, 2025
2025
-
[46]
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024
-
[47]
Mmlongbench-doc: Benchmarking long-context document understanding with visualizations
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track
-
[48]
Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating
Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...
2025
-
[49]
Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E Gonzalez, Trevor Darrell, and David M Chan. Visual haystacks: A vision-centric needle-in-a-haystack benchmark.arXiv preprint arXiv:2407.13766, 2024
-
[50]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025
2025
-
[51]
Dynamically scaled rope further increases performance of long context llama with zero fine-tuning,
Emozilla. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning,
-
[52]
URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_ increases/
-
[53]
Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram’e, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Gemma 4: Our most intelligent open models.URL https://deepmind.google/models/gemma/gemma-4/, 2026
Google DeepMind. Gemma 4: Our most intelligent open models.URL https://deepmind.google/models/gemma/gemma-4/, 2026
2026
-
[55]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Slidevqa: A dataset for document visual question answering on multiple images
Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13636–13645, 2023
2023
-
[57]
Slimpajama: A 627b token cleaned and deduplicated version of redpajama, 2023
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. Slimpajama: A 627b token cleaned and deduplicated version of redpajama, 2023
2023
-
[58]
The stack: 3 tb of permissively licensed source code
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code.arXiv preprint arXiv:2211.15533, 2022
-
[59]
Veomni: Scaling any modality model training with model-centric distributed recipe zoo
Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025
-
[60]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations
-
[61]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024
2024
-
[62]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Cong He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Ying Xiong, Wenwen Qu, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, Apr 2026
OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, Apr 2026. OpenAI Blog
2026
-
[65]
Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023. 14
2023
-
[66]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024
2024
-
[67]
Grok-1.5 vision preview.https://x.ai/blog/grok-1.5v, 2024
X.AI. Grok-1.5 vision preview.https://x.ai/blog/grok-1.5v, 2024
2024
-
[68]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024
2024
-
[69]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025
2025
-
[70]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
2024
-
[72]
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023
bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023. URLhttps://www.reddit.com/r/LocalLLaMA/comments/ 14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
2023
-
[73]
by parts
bloc97. Add NTK-Aware interpolation "by parts" correction, 2023. URLhttps://github.com/jquesnelle/ scaled-rope/pull/1
2023
-
[74]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[75]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 15 Appendix A Final Recipe and Implementation A.1 Final LongPT Recipe We summarize the final LongPT recipe ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
John Doe
String: General text, names, short sentences, or short phrases found in the document. - Example: "John Doe", "Financial Report", "The project was completed on time."
-
[77]
42", "2023
Integer: Whole numbers representing counts, years, page numbers, etc. - Example: "42", "2023", "100"
2023
-
[78]
12.5", "$45.20
Float: Numbers with decimal points, including currency, percentages, or scientific measurements. - Example: "12.5", "$45.20", "98.5%"
-
[79]
Apple",
List: A collection of multiple items, names, or values. - Example: ["Apple", "Banana", "Orange"], ["Item A", "Item B"]. Your output format is: [Evidence Description]: [JSON]: {"question": <question>, "answer": <answer>, "answer_format": <answer_format>, "evidence_pages": [<page_index>], "evidence_sources": [<evidence_sources>]} <> <in-context exemplars_1>...
-
[80]
What is the difference in revenue between Q1 and Q2?
Calculate: Performing arithmetic operations on data found in the document. 2. Compare: Comparing values to identify trends, maximums, minimums, or relative sizes. 3. Count: Counting the frequency of specific items, keywords, or visual elements that satisfy a condition. The answer must require the aforementioned processing step rather than being directly v...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.