arxiv: 2604.02371 · v1 · submitted 2026-03-31 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

Internalized Reasoning for Long-Context Visual Document Understanding

Austin Veselka

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords internalized reasoningvisual document understandingsynthetic data pipelinelong-context vision-language modelsmodel mergingsupervised fine-tuningMMLongBenchDocchain-of-thought

0 comments

The pith

Synthetic reasoning traces internalized via model merging let smaller vision-language models outperform much larger ones on long visual documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to create synthetic thinking traces for visual long-document questions by scoring page relevance and ordering evidence. These traces are used for supervised fine-tuning inside special tags, then the reasoning is internalized through low-strength merging with the base model. This approach enables a 32 billion parameter model to exceed the performance of a 235 billion parameter model on a key benchmark while producing far fewer output tokens. The method addresses the lack of reasoning exploration in previous open recipes for document understanding, which is important for applications like legal review and scientific analysis where long contexts are common.

Core claim

By generating synthetic reasoning traces through page relevance scoring, evidence extraction, and ordering, then applying SFT gated by a control token and internalizing via low-strength merging, the reasoning capability becomes part of the model's parameters, allowing high performance on visual long-document tasks without explicit chain-of-thought at inference time.

What carries the argument

The synthetic data pipeline for generating thinking traces by scoring each page for question relevance, extracting textual evidence, and ordering it from most to least relevant, combined with SFT in <think> tags and low-strength model merging.

Load-bearing premise

The synthetic traces generated by page relevance scoring and evidence ordering are of sufficient quality to teach genuine internalized reasoning rather than just superficial patterns.

What would settle it

Evaluating the method on a held-out long-document benchmark where the smaller model fails to exceed the larger model's score or shows no token reduction would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2604.02371 by Austin Veselka.

**Figure 2.** Figure 2: Output length distributions for Qwen and Mistral [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: An example from the v1 dataset. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: An example from the v2 dataset. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{<think>} tags, gated by a \texttt{<cot>} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7$\times$ larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version's traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4$\times$ fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A 32B visual model beats a 235B one on MMLongBenchDoc by internalizing synthetic reasoning traces from page scoring, but the traces lack any quality checks so the reasoning claim stays unproven.

read the letter

The main thing to know is that this paper gives a working pipeline for adding internalized reasoning to long-context visual document models. They score each page for relevance to the question, pull textual evidence, order it from most to least relevant, train with gated tags, and then use low-strength merging to bake the behavior in. On Qwen3 VL 32B this reaches 58.3 on MMLongBenchDoc, just above the 235B version at 57.0. With Mistral Small the synthetic traces beat distillation from a thinking model by 3.8 points and cut output tokens by 12.4x once internalized. They also release the pipeline, which is useful for anyone who wants to test it directly.

Referee Report

1 major / 1 minor

Summary. The paper introduces a synthetic data pipeline that generates reasoning traces for long-context visual document understanding by scoring each page for question relevance, extracting textual evidence, and ordering it from most to least relevant. These traces are used for SFT inside <think> tags gated by a <cot> control token, after which the reasoning capability is internalized through low-strength model merging. Experiments with Qwen3 VL 32B report 58.3 on MMLongBenchDoc (surpassing the 7× larger Qwen3 VL 235B), while Mistral Small 3.1 24B shows synthetic reasoning outperforming distillation by 3.8 points on MMLBD-C and 12.4× fewer mean output tokens than explicit reasoning. The pipeline is released for reproducibility.

Significance. If the results hold, the work offers a scalable route to add explicit reasoning to vision-language models for long documents while preserving inference efficiency, which is valuable for enterprise, legal, and scientific document tasks. The public release of the pipeline is a concrete strength that enables direct verification and extension.

major comments (1)

[Synthetic data pipeline] Synthetic data pipeline (abstract and §3): the performance gains (58.3 on MMLongBenchDoc, +3.8 over distillation, 12.4× token reduction) rest on the unverified assumption that relevance-scored and ordered traces contain the causal structure needed for genuine internalization. No human validation, inter-annotator agreement, or ablation (e.g., scored ordering vs. random ordering) is reported to rule out the possibility that SFT and merging simply amplify benchmark-specific heuristics.

minor comments (1)

[Methods] The description of the <cot> gating mechanism and low-strength merging hyper-parameter would benefit from explicit values or ranges used in the reported runs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: Synthetic data pipeline (abstract and §3): the performance gains (58.3 on MMLongBenchDoc, +3.8 over distillation, 12.4× token reduction) rest on the unverified assumption that relevance-scored and ordered traces contain the causal structure needed for genuine internalization. No human validation, inter-annotator agreement, or ablation (e.g., scored ordering vs. random ordering) is reported to rule out the possibility that SFT and merging simply amplify benchmark-specific heuristics.

Authors: We agree that the current manuscript does not report human validation, inter-annotator agreement, or ablations on ordering. The pipeline was designed to simulate structured reasoning by prioritizing relevant evidence pages, which we hypothesized would support internalization; this is supported by the observed gains over distillation and the large reduction in output tokens. To address the concern directly, we will add an ablation comparing relevance-scored ordering against random ordering, along with expanded discussion of the pipeline design in Section 3 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results from synthetic data pipeline

full rationale

The paper introduces a synthetic data pipeline that scores pages for relevance, extracts evidence, orders it, applies SFT within <think> tags gated by <cot>, and internalizes via low-strength merging. It then reports direct empirical measurements such as 58.3 on MMLongBenchDoc and 3.8-point gains over distillation. No equations, derivations, or self-referential definitions exist that reduce these scores to fitted parameters or prior outputs by construction. The results are measured outcomes on external benchmarks rather than tautological predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that synthetic traces produced by relevance scoring can effectively teach reasoning, plus standard assumptions about SFT and model merging; the merging strength is a free hyperparameter tuned empirically.

free parameters (1)

merging strength
Low-strength coefficient for model merging is chosen to add reasoning capability without degrading base model performance; value is not stated but must be selected on validation data.

axioms (2)

domain assumption Supervised fine-tuning on synthetic reasoning traces produces internalized reasoning capability
Core premise that the generated traces transfer useful reasoning behavior into the model weights.
domain assumption Low-strength model merging can combine capabilities without catastrophic interference
Standard assumption drawn from prior model merging work.

pith-pipeline@v0.9.0 · 5503 in / 1476 out tokens · 61711 ms · 2026-05-13T23:49:26.403704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 16 internal anchors

[1]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Wer...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Temporal chain of thought: Long-video understanding by thinking in frames, 2025

Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Temporal chain of thought: Long-video understanding by thinking in frames, 2025. URL https:// arxiv.org/abs/2507.02001

work page arXiv 2025
[3]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025. URL https: //arxiv.org/abs/2412.15204

work page arXiv 2025
[4]

Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026. URLhttps://arxiv.org/abs/2603.05488

work page arXiv 2026
[5]

Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025

Guanzheng Chen, Xin Li, Michael Qizhe Shieh, and Lidong Bing. Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025. URLhttps://arxiv.org/abs/2502.13922

work page arXiv 2025
[6]

Distilling reasoning ability from large language models with adaptive thinking

Xiaoshu Chen, Sihang Zhou, Ke Liang, and Xinwang Liu. Distilling reasoning ability from large language models with adaptive thinking, 2025. URL https://arxiv.org/abs/2404.09170

work page arXiv 2025
[7]

Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long-context visual language models for long videos, 2024. URLhttps://arxiv.org/abs/2408.10188

work page arXiv 2024
[8]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

work page
[10]

URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Implicit chain of thought reasoning via knowledge distillation, 2023

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation, 2023. URL https://arxiv.org/abs/2311.01460

work page arXiv 2023
[12]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024
[13]

Docopilot: Improving multimodal models for document-level understanding, 2025

Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, and Wenhai Wang. Docopilot: Improving multimodal models for document-level understanding, 2025. URL https://arxiv.org/abs/ 2507.14675

work page arXiv 2025
[14]

Nextlong: Toward effective long-context training without long documents, 2025

Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, and Songlin Hu. Nextlong: Toward effective long-context training without long documents, 2025. URL https://arxiv.org/abs/2501. 12766

work page 2025
[15]

How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively), 2025. URLhttps://arxiv.org/abs/2410.02660. 10

work page arXiv 2025
[16]

V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding, 2024

Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding, 2024. URLhttps://arxiv.org/abs/2412.09616

work page arXiv 2024
[17]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Context rot: How increasing input tokens impacts llm performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma, July 2025. URL https://research. trychroma.com/context-rot

work page 2025
[19]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. URL https: //arxiv.org/abs/2305.02301

work page arXiv 2023
[20]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small langu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023. URL https://arxiv.org/abs/2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024

Yoonsik Kim, Moonbin Yim, and Ka Yeon Song. Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024. URLhttps://arxiv.org/abs/2404.19205

work page arXiv 2024
[23]

Document understanding dataset and evalua- tion (dude), 2023

Jordy Van Landeghem, Rubén Tito, Łukasz Borchmann, Michał Pietruszka, Paweł Józiak, Rafał Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Ackaert, Ernest Valveny, Matthew Blaschko, Sien Moens, and Tomasz Stanisławek. Document understanding dataset and evalua- tion (dude), 2023. URLhttps://arxiv.org/abs/2305.08455

work page arXiv 2023
[24]

Luth: Efficient french specialization for small lan- guage models and cross-lingual transfer

Maxence Lasbordes and Sinoué Gad. Luth: Efficient french specialization for small lan- guage models and cross-lingual transfer. https://arxiv.org/abs/2510.05846, 2025. arXiv:2510.05846

work page arXiv 2025
[25]

Wildlong: Synthesizing realistic long-context instruction data at scale,

Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang Wang, Si-Qing Chen, Wei Lu, and Furu Wei. Wildlong: Synthesizing realistic long-context instruction data at scale,

work page
[26]

URLhttps://arxiv.org/abs/2502.16684

work page arXiv
[27]

Chain of thought empowers transformers to solve inherently serial problems, 2024

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems, 2024. URLhttps://arxiv.org/abs/2402.12875

work page arXiv 2024
[28]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023. URLhttps://arxiv.org/abs/2310.01889

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Deep Thinking by Markov Chain of Continuous Thoughts

Jiayu Liu, Zhenya Huang, Anya Sims, Enhong Chen, Yee Whye Teh, and Ning Miao. MARCOS: Deep thinking by markov chain of continuous thoughts, 2025. URL https://arxiv.org/ abs/2509.25020

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Available: https://doi.org/10.1162/tacl a 00449

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Bolt: Boost large vision-language model without training for long-form video understanding, 2025

Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding, 2025. URL https://arxiv.org/ abs/2503.21483

work page arXiv 2025
[32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101. 11

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations, 2024

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations, 2024. URLhttps://arxiv.org/abs/2407.01523

work page arXiv 2024
[34]

The expressive power of transformers with chain of thought, 2024

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought, 2024. URLhttps://arxiv.org/abs/2310.07923

work page arXiv 2024
[35]

Mistral small 3.1, 2025

MistralAI. Mistral small 3.1, 2025

work page 2025
[36]

pdfa-eng-wds, 2024

Pablo Montalvo and Ross Wightman. pdfa-eng-wds, 2024. URL https://huggingface.co/ datasets/pixparse/pdfa-eng-wds. Accessed 2026-01-23

work page 2024
[37]

GPT-4o System Card

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

OpenAI o1 System Card

OpenAI. Openai o1 system card, 2024. URLhttps://arxiv.org/abs/2412.16720

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Introducing gpt-5.3-codex, February 2026

OpenAI. Introducing gpt-5.3-codex, February 2026. URL https://openai.com/index/ introducing-gpt-5-3-codex/. Accessed: 2026-02-28

work page 2026
[40]

Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy, 2025

Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, and Xing Sun. Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy, 2025. URL https://arxiv.org/abs/2502. 05177

work page 2025
[41]

Solopo: Unlocking long-context capabilities in llms via short-to-long preference optimization, 2025

Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, and Fei Huang. Solopo: Unlocking long-context capabilities in llms via short-to-long preference optimization, 2025. URL https://arxiv.org/abs/2505. 11166

work page 2025
[42]

Slidevqa: A dataset for document visual question answering on multiple images, 2023

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images, 2023. URLhttps://arxiv.org/abs/2301.04883

work page arXiv 2023
[43]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

How to train your long-context visual document model.arXiv preprint arXiv:2602.15257, 2026

Austin Veselka. How to train your long-context visual document model, 2026. URL https: //arxiv.org/abs/2602.15257

work page arXiv 2026
[46]

Somin Wadhwa, Silvio Amir, and Byron C. Wallace. Investigating mysteries of cot-augmented distillation, 2024. URLhttps://arxiv.org/abs/2406.14511

work page arXiv 2024
[47]

Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning, 2025

Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, and Ming Yan. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.17667

work page arXiv 2025
[48]

Bootstrap your own context length, 2025

Liang Wang, Nan Yang, Xingxing Zhang, Xiaolong Huang, and Furu Wei. Bootstrap your own context length, 2025. URLhttps://arxiv.org/abs/2412.18860

work page arXiv 2025
[49]

Mmlong- bench: Benchmarking long-context vision-language models effectively and thoroughly.arXiv preprint arXiv:2505.10610, 2025

Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, and Mark Steedman. Mmlong- bench: Benchmarking long-context vision-language models effectively and thoroughly, 2025. URLhttps://arxiv.org/abs/2505.10610

work page arXiv 2025
[50]

Self-preference bias in LLM-as-a-judge,

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in LLM-as-a-judge,

work page
[51]

URLhttps://openreview.net/forum?id=Ns8zGZ0lmM

work page
[52]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

2021 , month = jul, journal =

Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers, 2021. URLhttps: //arxiv.org/abs/2106.06981

work page arXiv 2021
[54]

Dual-head reasoning distillation: Improving classifier ac- curacy with train-time-only reasoning, 2025

Jillian Xu, Dylan Zhou, Vinay Shukla, Yang Yang, Junrui Ruan, Shuhuai Lin, Wenfei Zou, Yinxiao Liu, and Karthik Lakshmanan. Dual-head reasoning distillation: Improving classifier ac- curacy with train-time-only reasoning, 2025. URLhttps://arxiv.org/abs/2509.21487

work page arXiv 2025
[55]

Longfaith: Enhancing long-context reasoning in llms with faithful synthetic data,

Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, and Jian Guo. Longfaith: Enhancing long-context reasoning in llms with faithful synthetic data,

work page
[56]

URLhttps://arxiv.org/abs/2502.12583

work page arXiv
[57]

Helmet: How to evaluate long-context language models effectively and thoroughly, 2025

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly, 2025. URLhttps://arxiv.org/abs/2410.02694

work page arXiv 2025
[58]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Z.ai. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2026. URLhttps://arxiv.org/abs/2507.01006

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D. Goodman. Quiet-STaR: Language models can teach themselves to think before speaking, 2024. URL https://arxiv.org/abs/2403.09629

work page arXiv 2024
[60]

Improve vision language model chain-of-thought reasoning, 2024

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning, 2024. URLhttps://arxiv.org/abs/2410.16198

work page arXiv 2024
[61]

Chain-of-thought tokens are computer program variables, 2025

Fangwei Zhu, Peiyi Wang, and Zhifang Sui. Chain-of-thought tokens are computer program variables, 2025. URLhttps://arxiv.org/abs/2505.04955

work page arXiv 2025
[62]

ring-flash-attn, 2024

Zilin Zhu. ring-flash-attn, 2024. URL https://github.com/zhuzilin/ ring-flash-attention. 13 Benchmark Metric MMLongBenchDoc / MMLBD-C F1 (overall_f1) MMLongBench (32K/128K) Avg of task-specific metrics ∗ SlideVQA Mini ANLS (Average Normalized Levenshtein Similarity) HELMET (32K/128K) Overall Score LongBench v2 Overall Accuracy DUDE Mini ANLS (Average Norm...

work page 2024
[63]

(and the corrected variant MMLBD-C [ 43]); MMLongBench [47] at 32K and 128K context (document QA, visual RAG, ICL, summarization); SlideVQA Mini [40]; HELMET [53] at 32K and 128K context (recall, RAG, summarization, ICL, reranking); LongBench v2 [3]; DUDE Mini [22]; TableVQA [21]. In contrast to the default VLM Eval Kit [ 11] settings, we increase the max...

work page
[64]

For light rail, the average of fatalities during these 10 years is 20 per year, but the average amount of fatalities related to Subway is **59 per year**

before the release of 4.6V and we adopt the same evaluation protocol for fair comparison. The local judge and use of F1 are the main factors driving the score difference between official MMLBD results and the ones reported in our paper. Table 9 lists the primary metric used for each benchmark in our evaluation suite. We release an html file with the full ...

work page 2002