pith. machine review for the scientific record. sign in

arxiv: 2604.02371 · v1 · submitted 2026-03-31 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

Internalized Reasoning for Long-Context Visual Document Understanding

Austin Veselka

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords internalized reasoningvisual document understandingsynthetic data pipelinelong-context vision-language modelsmodel mergingsupervised fine-tuningMMLongBenchDocchain-of-thought
0
0 comments X

The pith

Synthetic reasoning traces internalized via model merging let smaller vision-language models outperform much larger ones on long visual documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to create synthetic thinking traces for visual long-document questions by scoring page relevance and ordering evidence. These traces are used for supervised fine-tuning inside special tags, then the reasoning is internalized through low-strength merging with the base model. This approach enables a 32 billion parameter model to exceed the performance of a 235 billion parameter model on a key benchmark while producing far fewer output tokens. The method addresses the lack of reasoning exploration in previous open recipes for document understanding, which is important for applications like legal review and scientific analysis where long contexts are common.

Core claim

By generating synthetic reasoning traces through page relevance scoring, evidence extraction, and ordering, then applying SFT gated by a control token and internalizing via low-strength merging, the reasoning capability becomes part of the model's parameters, allowing high performance on visual long-document tasks without explicit chain-of-thought at inference time.

What carries the argument

The synthetic data pipeline for generating thinking traces by scoring each page for question relevance, extracting textual evidence, and ordering it from most to least relevant, combined with SFT in <think> tags and low-strength model merging.

Load-bearing premise

The synthetic traces generated by page relevance scoring and evidence ordering are of sufficient quality to teach genuine internalized reasoning rather than just superficial patterns.

What would settle it

Evaluating the method on a held-out long-document benchmark where the smaller model fails to exceed the larger model's score or shows no token reduction would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2604.02371 by Austin Veselka.

Figure 1
Figure 1. Figure 1: Our proposed synthetic reasoning pipeline. For a given document and question, we extract [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Output length distributions for Qwen and Mistral [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example from the v1 dataset. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example from the v2 dataset. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{<think>} tags, gated by a \texttt{<cot>} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7$\times$ larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version's traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4$\times$ fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a synthetic data pipeline that generates reasoning traces for long-context visual document understanding by scoring each page for question relevance, extracting textual evidence, and ordering it from most to least relevant. These traces are used for SFT inside <think> tags gated by a <cot> control token, after which the reasoning capability is internalized through low-strength model merging. Experiments with Qwen3 VL 32B report 58.3 on MMLongBenchDoc (surpassing the 7× larger Qwen3 VL 235B), while Mistral Small 3.1 24B shows synthetic reasoning outperforming distillation by 3.8 points on MMLBD-C and 12.4× fewer mean output tokens than explicit reasoning. The pipeline is released for reproducibility.

Significance. If the results hold, the work offers a scalable route to add explicit reasoning to vision-language models for long documents while preserving inference efficiency, which is valuable for enterprise, legal, and scientific document tasks. The public release of the pipeline is a concrete strength that enables direct verification and extension.

major comments (1)
  1. [Synthetic data pipeline] Synthetic data pipeline (abstract and §3): the performance gains (58.3 on MMLongBenchDoc, +3.8 over distillation, 12.4× token reduction) rest on the unverified assumption that relevance-scored and ordered traces contain the causal structure needed for genuine internalization. No human validation, inter-annotator agreement, or ablation (e.g., scored ordering vs. random ordering) is reported to rule out the possibility that SFT and merging simply amplify benchmark-specific heuristics.
minor comments (1)
  1. [Methods] The description of the <cot> gating mechanism and low-strength merging hyper-parameter would benefit from explicit values or ranges used in the reported runs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: Synthetic data pipeline (abstract and §3): the performance gains (58.3 on MMLongBenchDoc, +3.8 over distillation, 12.4× token reduction) rest on the unverified assumption that relevance-scored and ordered traces contain the causal structure needed for genuine internalization. No human validation, inter-annotator agreement, or ablation (e.g., scored ordering vs. random ordering) is reported to rule out the possibility that SFT and merging simply amplify benchmark-specific heuristics.

    Authors: We agree that the current manuscript does not report human validation, inter-annotator agreement, or ablations on ordering. The pipeline was designed to simulate structured reasoning by prioritizing relevant evidence pages, which we hypothesized would support internalization; this is supported by the observed gains over distillation and the large reduction in output tokens. To address the concern directly, we will add an ablation comparing relevance-scored ordering against random ordering, along with expanded discussion of the pipeline design in Section 3 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results from synthetic data pipeline

full rationale

The paper introduces a synthetic data pipeline that scores pages for relevance, extracts evidence, orders it, applies SFT within <think> tags gated by <cot>, and internalizes via low-strength merging. It then reports direct empirical measurements such as 58.3 on MMLongBenchDoc and 3.8-point gains over distillation. No equations, derivations, or self-referential definitions exist that reduce these scores to fitted parameters or prior outputs by construction. The results are measured outcomes on external benchmarks rather than tautological predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that synthetic traces produced by relevance scoring can effectively teach reasoning, plus standard assumptions about SFT and model merging; the merging strength is a free hyperparameter tuned empirically.

free parameters (1)
  • merging strength
    Low-strength coefficient for model merging is chosen to add reasoning capability without degrading base model performance; value is not stated but must be selected on validation data.
axioms (2)
  • domain assumption Supervised fine-tuning on synthetic reasoning traces produces internalized reasoning capability
    Core premise that the generated traces transfer useful reasoning behavior into the model weights.
  • domain assumption Low-strength model merging can combine capabilities without catastrophic interference
    Standard assumption drawn from prior model merging work.

pith-pipeline@v0.9.0 · 5503 in / 1476 out tokens · 61711 ms · 2026-05-13T23:49:26.403704+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 16 internal anchors

  1. [1]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Wer...

  2. [2]

    Temporal chain of thought: Long-video understanding by thinking in frames, 2025

    Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Temporal chain of thought: Long-video understanding by thinking in frames, 2025. URL https:// arxiv.org/abs/2507.02001

  3. [3]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025. URL https: //arxiv.org/abs/2412.15204

  4. [4]

    Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026

    Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026. URLhttps://arxiv.org/abs/2603.05488

  5. [5]

    Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025

    Guanzheng Chen, Xin Li, Michael Qizhe Shieh, and Lidong Bing. Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025. URLhttps://arxiv.org/abs/2502.13922

  6. [6]

    Distilling reasoning ability from large language models with adaptive thinking

    Xiaoshu Chen, Sihang Zhou, Ke Liang, and Xinwang Liu. Distilling reasoning ability from large language models with adaptive thinking, 2025. URL https://arxiv.org/abs/2404.09170

  7. [7]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long-context visual language models for long videos, 2024. URLhttps://arxiv.org/abs/2408.10188

  8. [8]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

  9. [9]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

  10. [10]

    URLhttps://arxiv.org/abs/2501.12948

  11. [11]

    Implicit chain of thought reasoning via knowledge distillation, 2023

    Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation, 2023. URL https://arxiv.org/abs/2311.01460

  12. [12]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  13. [13]

    Docopilot: Improving multimodal models for document-level understanding, 2025

    Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, and Wenhai Wang. Docopilot: Improving multimodal models for document-level understanding, 2025. URL https://arxiv.org/abs/ 2507.14675

  14. [14]

    Nextlong: Toward effective long-context training without long documents, 2025

    Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, and Songlin Hu. Nextlong: Toward effective long-context training without long documents, 2025. URL https://arxiv.org/abs/2501. 12766

  15. [15]

    How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024

    Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively), 2025. URLhttps://arxiv.org/abs/2410.02660. 10

  16. [16]

    V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding, 2024

    Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding, 2024. URLhttps://arxiv.org/abs/2412.09616

  17. [17]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

  18. [18]

    Context rot: How increasing input tokens impacts llm performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma, July 2025. URL https://research. trychroma.com/context-rot

  19. [19]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. URL https: //arxiv.org/abs/2305.02301

  20. [20]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small langu...

  21. [21]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023. URL https://arxiv.org/abs/2212.04089

  22. [22]

    Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024

    Yoonsik Kim, Moonbin Yim, and Ka Yeon Song. Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024. URLhttps://arxiv.org/abs/2404.19205

  23. [23]

    Document understanding dataset and evalua- tion (dude), 2023

    Jordy Van Landeghem, Rubén Tito, Łukasz Borchmann, Michał Pietruszka, Paweł Józiak, Rafał Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Ackaert, Ernest Valveny, Matthew Blaschko, Sien Moens, and Tomasz Stanisławek. Document understanding dataset and evalua- tion (dude), 2023. URLhttps://arxiv.org/abs/2305.08455

  24. [24]

    Luth: Efficient french specialization for small lan- guage models and cross-lingual transfer

    Maxence Lasbordes and Sinoué Gad. Luth: Efficient french specialization for small lan- guage models and cross-lingual transfer. https://arxiv.org/abs/2510.05846, 2025. arXiv:2510.05846

  25. [25]

    Wildlong: Synthesizing realistic long-context instruction data at scale,

    Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang Wang, Si-Qing Chen, Wei Lu, and Furu Wei. Wildlong: Synthesizing realistic long-context instruction data at scale,

  26. [26]

    URLhttps://arxiv.org/abs/2502.16684

  27. [27]

    Chain of thought empowers transformers to solve inherently serial problems, 2024

    Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems, 2024. URLhttps://arxiv.org/abs/2402.12875

  28. [28]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023. URLhttps://arxiv.org/abs/2310.01889

  29. [29]

    Deep Thinking by Markov Chain of Continuous Thoughts

    Jiayu Liu, Zhenya Huang, Anya Sims, Enhong Chen, Yee Whye Teh, and Ning Miao. MARCOS: Deep thinking by markov chain of continuous thoughts, 2025. URL https://arxiv.org/ abs/2509.25020

  30. [30]

    Available: https://doi.org/10.1162/tacl a 00449

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172

  31. [31]

    Bolt: Boost large vision-language model without training for long-form video understanding, 2025

    Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding, 2025. URL https://arxiv.org/ abs/2503.21483

  32. [32]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101. 11

  33. [33]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations, 2024

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations, 2024. URLhttps://arxiv.org/abs/2407.01523

  34. [34]

    The expressive power of transformers with chain of thought, 2024

    William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought, 2024. URLhttps://arxiv.org/abs/2310.07923

  35. [35]

    Mistral small 3.1, 2025

    MistralAI. Mistral small 3.1, 2025

  36. [36]

    pdfa-eng-wds, 2024

    Pablo Montalvo and Ross Wightman. pdfa-eng-wds, 2024. URL https://huggingface.co/ datasets/pixparse/pdfa-eng-wds. Accessed 2026-01-23

  37. [37]

    GPT-4o System Card

    OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

  38. [38]

    OpenAI o1 System Card

    OpenAI. Openai o1 system card, 2024. URLhttps://arxiv.org/abs/2412.16720

  39. [39]

    Introducing gpt-5.3-codex, February 2026

    OpenAI. Introducing gpt-5.3-codex, February 2026. URL https://openai.com/index/ introducing-gpt-5-3-codex/. Accessed: 2026-02-28

  40. [40]

    Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy, 2025

    Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, and Xing Sun. Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy, 2025. URL https://arxiv.org/abs/2502. 05177

  41. [41]

    Solopo: Unlocking long-context capabilities in llms via short-to-long preference optimization, 2025

    Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, and Fei Huang. Solopo: Unlocking long-context capabilities in llms via short-to-long preference optimization, 2025. URL https://arxiv.org/abs/2505. 11166

  42. [42]

    Slidevqa: A dataset for document visual question answering on multiple images, 2023

    Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images, 2023. URLhttps://arxiv.org/abs/2301.04883

  43. [43]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  44. [44]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631

  45. [45]

    How to train your long-context visual document model.arXiv preprint arXiv:2602.15257, 2026

    Austin Veselka. How to train your long-context visual document model, 2026. URL https: //arxiv.org/abs/2602.15257

  46. [46]

    Somin Wadhwa, Silvio Amir, and Byron C. Wallace. Investigating mysteries of cot-augmented distillation, 2024. URLhttps://arxiv.org/abs/2406.14511

  47. [47]

    Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning, 2025

    Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, and Ming Yan. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.17667

  48. [48]

    Bootstrap your own context length, 2025

    Liang Wang, Nan Yang, Xingxing Zhang, Xiaolong Huang, and Furu Wei. Bootstrap your own context length, 2025. URLhttps://arxiv.org/abs/2412.18860

  49. [49]

    Mmlong- bench: Benchmarking long-context vision-language models effectively and thoroughly.arXiv preprint arXiv:2505.10610, 2025

    Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, and Mark Steedman. Mmlong- bench: Benchmarking long-context vision-language models effectively and thoroughly, 2025. URLhttps://arxiv.org/abs/2505.10610

  50. [50]

    Self-preference bias in LLM-as-a-judge,

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in LLM-as-a-judge,

  51. [51]

    URLhttps://openreview.net/forum?id=Ns8zGZ0lmM

  52. [52]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903. 12

  53. [53]

    2021 , month = jul, journal =

    Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers, 2021. URLhttps: //arxiv.org/abs/2106.06981

  54. [54]

    Dual-head reasoning distillation: Improving classifier ac- curacy with train-time-only reasoning, 2025

    Jillian Xu, Dylan Zhou, Vinay Shukla, Yang Yang, Junrui Ruan, Shuhuai Lin, Wenfei Zou, Yinxiao Liu, and Karthik Lakshmanan. Dual-head reasoning distillation: Improving classifier ac- curacy with train-time-only reasoning, 2025. URLhttps://arxiv.org/abs/2509.21487

  55. [55]

    Longfaith: Enhancing long-context reasoning in llms with faithful synthetic data,

    Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, and Jian Guo. Longfaith: Enhancing long-context reasoning in llms with faithful synthetic data,

  56. [56]

    URLhttps://arxiv.org/abs/2502.12583

  57. [57]

    Helmet: How to evaluate long-context language models effectively and thoroughly, 2025

    Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly, 2025. URLhttps://arxiv.org/abs/2410.02694

  58. [58]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Z.ai. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2026. URLhttps://arxiv.org/abs/2507.01006

  59. [59]

    Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D. Goodman. Quiet-STaR: Language models can teach themselves to think before speaking, 2024. URL https://arxiv.org/abs/2403.09629

  60. [60]

    Improve vision language model chain-of-thought reasoning, 2024

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning, 2024. URLhttps://arxiv.org/abs/2410.16198

  61. [61]

    Chain-of-thought tokens are computer program variables, 2025

    Fangwei Zhu, Peiyi Wang, and Zhifang Sui. Chain-of-thought tokens are computer program variables, 2025. URLhttps://arxiv.org/abs/2505.04955

  62. [62]

    ring-flash-attn, 2024

    Zilin Zhu. ring-flash-attn, 2024. URL https://github.com/zhuzilin/ ring-flash-attention. 13 Benchmark Metric MMLongBenchDoc / MMLBD-C F1 (overall_f1) MMLongBench (32K/128K) Avg of task-specific metrics ∗ SlideVQA Mini ANLS (Average Normalized Levenshtein Similarity) HELMET (32K/128K) Overall Score LongBench v2 Overall Accuracy DUDE Mini ANLS (Average Norm...

  63. [63]

    (and the corrected variant MMLBD-C [ 43]); MMLongBench [47] at 32K and 128K context (document QA, visual RAG, ICL, summarization); SlideVQA Mini [40]; HELMET [53] at 32K and 128K context (recall, RAG, summarization, ICL, reranking); LongBench v2 [3]; DUDE Mini [22]; TableVQA [21]. In contrast to the default VLM Eval Kit [ 11] settings, we increase the max...

  64. [64]

    For light rail, the average of fatalities during these 10 years is 20 per year, but the average amount of fatalities related to Subway is **59 per year**

    before the release of 4.6V and we adopt the same evaluation protocol for fair comparison. The local judge and use of F1 are the main factors driving the score difference between official MMLBD results and the ones reported in our paper. Table 9 lists the primary metric used for each benchmark in our evaluation suite. We release an html file with the full ...