P-MTP: Efficient Document Parsing via Multi-Token Prediction with Progressive Depth Scaling

Chenxi Zhai; Jingjing Wu; Kunbin Chen; Le Xiang; Qunyi Xie; Shu Wei; Wei He; Xiao Tan

arxiv: 2606.24447 · v1 · pith:XQQCCYU7new · submitted 2026-06-23 · 💻 cs.CV

P-MTP: Efficient Document Parsing via Multi-Token Prediction with Progressive Depth Scaling

Le Xiang , Chenxi Zhai , Shu Wei , Jingjing Wu , Qunyi Xie , Xiao Tan , Kunbin Chen , Wei He This is my paper

Pith reviewed 2026-06-26 00:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords document parsingmulti-token predictionvision-language modelsinference accelerationprogressive curriculum lossspeculative decodingdynamic drafting

0 comments

The pith

P-MTP scales look-ahead depth in multi-token prediction to deliver up to 5x faster document parsing with negligible accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for document parsing suffer from high latency on token-dense pages. The paper shows that multi-token prediction can be stabilized at greater look-ahead depths by introducing a progressive curriculum loss that re-weights predictions according to cumulative path reliability and retrospective target consistency. This loss creates an automatic easy-to-hard training schedule that suppresses gradient noise in long-range forecasts. A lightweight MTP module plus confidence-gated dynamic drafting at inference time then maximizes the number of accepted tokens. The result is the first reported validation that extensive look-ahead multi-token prediction works in the document domain, producing the stated speedups across benchmarks and model architectures.

Core claim

P-MTP introduces Progressive Multi-Token Prediction together with a lightweight MTP module that scales look-ahead depth. The Progressive Curriculum Loss adaptively re-weights look-ahead depths using cumulative path reliability and retrospective target consistency, suppressing gradient noise so the model can master increasingly distant predictions through an automated easy-to-hard transition. Confidence-Gated Dynamic Drafting then calibrates speculative length at inference to raise acceptance rate and reduce wasted computation. Across multiple benchmarks and architectures this combination yields up to 5 imes speedup with negligible loss in accuracy.

What carries the argument

Progressive Curriculum Loss that re-weights look-ahead depths by cumulative path reliability and retrospective target consistency to stabilize deeper predictions.

If this is right

Document parsing throughput increases by a factor of five on existing hardware without retraining the base vision-language model.
The same lightweight MTP module and loss can be attached to multiple existing architectures while preserving end-to-end structured-text output quality.
Higher acceptance rates during inference reduce the number of rejected draft tokens and therefore the average compute per page.
The method supplies the first empirical evidence that extensive look-ahead multi-token prediction is viable inside the document parsing domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The progressive loss schedule may generalize to other dense-prediction VLM tasks such as chart understanding or long-form captioning if the reliability metrics transfer.
Further increases in maximum look-ahead depth could become feasible by adding a second-stage consistency check that the current drafting gate does not yet include.
Energy cost per parsed page would drop in proportion to the observed speedup, which matters for large-scale document archives processed in the cloud.

Load-bearing premise

The adaptive re-weighting in the curriculum loss is sufficient to suppress gradient noise and permit stable training once look-ahead depth increases beyond shallow values.

What would settle it

Train an otherwise identical model without the progressive curriculum loss and measure whether accuracy collapses or training diverges at look-ahead depths of four or greater.

read the original abstract

Vision-Language Models (VLMs) have revolutionized document parsing by enabling end-to-end mapping from images to structured text, imposing a significant latency bottleneck, particularly for token-dense documents. While Multi-Token Prediction (MTP) has emerged as a promising approach for accelerating inference, its potential is constrained by optimization instability when scaling to deeper look-ahead depth. In this paper, we propose \textbf{P-MTP}, a framework that leverages \textbf{Progressive Multi-Token Prediction} with a lightweight MTP module to scale the look-ahead depth for high-throughput document parsing. Specifically, we introduce Progressive Curriculum Loss that adaptively re-weights different look-ahead depths using cumulative path reliability and retrospective target consistency. By effectively suppressing gradient noise in long-range predictions, P-MTP, facilitates an automated easy-to-hard optimization transition, enabling the model to master increasingly distant look-ahead depths. Furthermore, we propose Confidence-Gated Dynamic Drafting to maximize the effective look-ahead depth and acceptance rate by adaptively calibrating speculative length during inference, thereby minimizing computational waste and further pushing the boundaries of inference speedup. Experimental results across multiple benchmarks and architectures demonstrate that P-MTP, achieves up to a $5\times$ speedup with negligible loss in accuracy, providing the first successful validation of extensive look-ahead MTP in the document parsing domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

P-MTP applies progressive curriculum loss and dynamic drafting to stabilize deeper MTP for document parsing and claims a 5x speedup, but the abstract supplies no data to check the results.

read the letter

The main point is that this paper takes multi-token prediction, scales the look-ahead depth for vision-language document parsing, and reports up to 5x inference speedup with little accuracy cost. It frames the work as the first solid use of extensive look-ahead MTP in this setting.

The actual contribution is the progressive curriculum loss, which re-weights look-ahead depths by cumulative path reliability and retrospective target consistency to cut gradient noise and support an easy-to-hard training path. The confidence-gated dynamic drafting then adjusts speculative length at inference time to raise acceptance rates. These are straightforward engineering moves on top of existing speculative decoding techniques, and they directly target the instability that usually stops deeper MTP from working.

The experiments are described only at a high level: results across benchmarks and architectures show the speedup. If the full paper contains proper ablations, baseline comparisons, and error bars, the practical value for high-volume document workflows would be real.

The soft spots are the missing details. The abstract gives no numbers, no training curves, and no derivation of the loss terms, so it is impossible to tell whether the new components are doing the work or whether the gains come from tuning. The reliability measures could be circular if they lean too heavily on the model's own predictions, but that cannot be checked without the equations and code.

This paper is for engineers and researchers who need faster token-dense VLM inference on documents. A reader already following speculative decoding work would see the domain-specific adaptations clearly.

It deserves a serious referee because the central claim is concrete and the method description is internally consistent enough to evaluate. I would send it to peer review.

Referee Report

3 major / 0 minor

Summary. The paper proposes P-MTP, a framework to accelerate inference in Vision-Language Models for document parsing via scaled Multi-Token Prediction (MTP). It introduces Progressive Curriculum Loss, which adaptively re-weights look-ahead depths using cumulative path reliability and retrospective target consistency to stabilize training, and Confidence-Gated Dynamic Drafting, which adaptively calibrates speculative length at inference time. The central claim is that these components enable up to 5× speedup with negligible accuracy loss across benchmarks and architectures, representing the first successful validation of extensive look-ahead MTP in the document parsing domain.

Significance. If the experimental claims hold with proper validation, the work would address a practical latency bottleneck in token-dense document parsing and demonstrate that deep MTP can be stabilized in this domain. The adaptive re-weighting and dynamic drafting ideas could generalize to other speculative decoding settings, but their impact depends on whether the mechanisms are shown to be robust rather than task-specific.

major comments (3)

[Abstract] Abstract: The central performance claim of 'up to a 5× speedup with negligible loss in accuracy' is stated without any quantitative results, tables, error bars, ablation studies, or specific benchmark numbers. This prevents assessment of whether the claim is supported by the experiments.
[Abstract] Abstract: The Progressive Curriculum Loss is described only qualitatively ('adaptively re-weights ... using cumulative path reliability and retrospective target consistency'); no equations, loss formulation, or derivation details are supplied, so it is impossible to verify whether it suppresses gradient noise or enables the claimed easy-to-hard transition.
[Abstract] Abstract: No information is provided on the architectures tested, the document parsing benchmarks used, the baselines compared against, or the look-ahead depths achieved, making it impossible to evaluate the scope or reproducibility of the 'first successful validation' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that enhancing the abstract with more specific details will improve clarity and allow better assessment of the claims. We will revise the abstract accordingly while ensuring it remains concise. Below we address each comment point by point.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim of 'up to a 5× speedup with negligible loss in accuracy' is stated without any quantitative results, tables, error bars, ablation studies, or specific benchmark numbers. This prevents assessment of whether the claim is supported by the experiments.

Authors: We agree that the abstract would benefit from concrete numbers to support the claim. In the revised version, we will incorporate specific quantitative results (e.g., peak speedups on particular benchmarks with associated accuracy deltas), reference the main results table, and note the presence of ablations and error bars in the experimental section. This directly addresses the need for evidence within the abstract itself. revision: yes
Referee: [Abstract] Abstract: The Progressive Curriculum Loss is described only qualitatively ('adaptively re-weights ... using cumulative path reliability and retrospective target consistency'); no equations, loss formulation, or derivation details are supplied, so it is impossible to verify whether it suppresses gradient noise or enables the claimed easy-to-hard transition.

Authors: The abstract provides a high-level overview due to space constraints. The full loss formulation, equations, and derivation demonstrating gradient noise suppression and the easy-to-hard transition are provided in Section 3.2. We will revise the abstract to include a brief reference to the key components of the loss (e.g., the adaptive re-weighting terms) to improve verifiability without exceeding length limits. revision: partial
Referee: [Abstract] Abstract: No information is provided on the architectures tested, the document parsing benchmarks used, the baselines compared against, or the look-ahead depths achieved, making it impossible to evaluate the scope or reproducibility of the 'first successful validation' claim.

Authors: We will update the abstract to explicitly list the tested architectures, benchmarks (e.g., standard document parsing datasets), baselines (including autoregressive decoding and prior MTP variants), and achieved look-ahead depths. These details are already reported in Sections 4 and 5; adding them to the abstract will strengthen the reproducibility and scope assessment of the validation claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and available description outline Progressive Curriculum Loss (adaptive re-weighting via cumulative path reliability and retrospective target consistency) and Confidence-Gated Dynamic Drafting without any equations, self-citations, or derivations that reduce by construction to fitted inputs or prior self-referential results. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renaming of known results are present. The speedup claim is positioned as an empirical outcome across benchmarks rather than a mathematical identity derived from the method itself. The derivation chain is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review is abstract-only; no equations, training details, or full methods are available to enumerate free parameters, axioms, or invented entities beyond the high-level components named in the abstract.

invented entities (2)

Progressive Curriculum Loss no independent evidence
purpose: Adaptively re-weight look-ahead depths to suppress gradient noise
Introduced to enable stable scaling of prediction depth
Confidence-Gated Dynamic Drafting no independent evidence
purpose: Adaptively calibrate speculative length during inference
Proposed to maximize effective look-ahead depth and acceptance rate

pith-pipeline@v0.9.1-grok · 5785 in / 1225 out tokens · 22864 ms · 2026-06-26T00:15:46.299328+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 11 linked inside Pith

[1]

Document parsing unveiled: Tech- niques, challenges, and prospects for structured information extraction

Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Tech- niques, challenges, and prospects for structured information extraction. arXiv preprint arXiv:2410.21169, 2024

Pith/arXiv arXiv 2024
[2]

Paddleocr 3.0 technical report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

Pith/arXiv arXiv 2025
[3]

Mineru: An open-source solution for precise document content extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839, 2024

Pith/arXiv arXiv 2024
[4]

Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding

Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, et al. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding. arXiv preprint arXiv:2601.20430, 2026

arXiv 2026
[5]

Dolphin: Document image parsing via heterogeneous anchor prompting

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

arXiv 2025
[6]

Dolphin-v2: Universal document parsing via scalable anchor prompting

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting. arXiv preprint arXiv:2602.05384, 2026

arXiv 2026
[7]

Glm-ocr.https://docs.z.ai/guides/vlm/glm-ocr, 2026

glmocr. Glm-ocr.https://docs.z.ai/guides/vlm/glm-ocr, 2026

2026
[8]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. arXiv preprint arXiv:2402.05109, 2024. 13

arXiv 2024
[9]

Amphista: Bi-directional multi-head decoding for accelerating llm inference

Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, and Emad Barsoum. Amphista: Bi-directional multi-head decoding for accelerating llm inference. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

2025
[10]

Gumiho: A hybrid architecture to prioritize early tokens in speculative decoding

Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith CH Ngai, and Emad Barsoum. Gumiho: A hybrid architecture to prioritize early tokens in speculative decoding. arXiv preprint arXiv:2503.10135, 2025

arXiv 2025
[11]

Draft& verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft& verify: Lossless large language model acceleration via self-speculative decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263–11282, 2024

2024
[12]

Resdecode: accelerating large language models inference via residual decoding heads

Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Fan Yu, Hongen Shao, and Xiaofeng Zou. Resdecode: accelerating large language models inference via residual decoding heads. Big Data Mining and Analytics, 8(4):779–793, 2025

2025
[13]

Accelerating large language model decoding with speculative sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

Pith/arXiv arXiv 2023
[14]

Unirec: Unified multimodal encoding for llm-based recommendations

Zijie Lei, Tao Feng, Zhigang Hua, Yan Xie, Guanyu Lin, Shuang Yang, Ge Liu, and Jiaxuan You. Unirec: Unified multimodal encoding for llm-based recommendations. arXiv preprint arXiv:2601.19423, 2026

Pith/arXiv arXiv 2026
[15]

Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

arXiv 2026
[16]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[17]

Deepseek-v3 technical report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[18]

Medusa: Simple llm inference acceleration framework with multiple decoding heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

Pith/arXiv arXiv 2024
[19]

Better & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

Pith/arXiv arXiv 2024
[20]

Eagle-3: Scaling up in- ference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up in- ference acceleration of large language models via training-time test. arXiv preprint arXiv:2503.01840, 2025

Pith/arXiv arXiv 2025
[21]

Eagle: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fang yun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. 14

Pith/arXiv arXiv 2024
[22]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

2024
[23]

Cdm: A reliable metric for fair and accurate formula recognition evaluation

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, and Conghui He. Cdm: A reliable metric for fair and accurate formula recognition evaluation. arXiv preprint arXiv:2409.03643, 5(6), 2024

arXiv 2024
[24]

Image-based table recognition: data, model, and evaluation

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. In European conference on computer vision, pages 564–580. Springer, 2020

2020
[25]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

2025
[26]

Lightonocr: A 1b end- to-end multilingual vision-language model for state-of-the-art ocr

Said Taghadouini, Adrien Cavaillès, and Baptiste Aubertin. Lightonocr: A 1b end- to-end multilingual vision-language model for state-of-the-art ocr. arXiv preprint arXiv:2601.14251, 2026

arXiv 2026
[27]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861, 2021

arXiv 2021
[28]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 15 Appendix A. Additional Results on Model Scaling Behavior ...

2023

[1] [1]

Document parsing unveiled: Tech- niques, challenges, and prospects for structured information extraction

Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Tech- niques, challenges, and prospects for structured information extraction. arXiv preprint arXiv:2410.21169, 2024

Pith/arXiv arXiv 2024

[2] [2]

Paddleocr 3.0 technical report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

Pith/arXiv arXiv 2025

[3] [3]

Mineru: An open-source solution for precise document content extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839, 2024

Pith/arXiv arXiv 2024

[4] [4]

Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding

Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, et al. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding. arXiv preprint arXiv:2601.20430, 2026

arXiv 2026

[5] [5]

Dolphin: Document image parsing via heterogeneous anchor prompting

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

arXiv 2025

[6] [6]

Dolphin-v2: Universal document parsing via scalable anchor prompting

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting. arXiv preprint arXiv:2602.05384, 2026

arXiv 2026

[7] [7]

Glm-ocr.https://docs.z.ai/guides/vlm/glm-ocr, 2026

glmocr. Glm-ocr.https://docs.z.ai/guides/vlm/glm-ocr, 2026

2026

[8] [8]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. arXiv preprint arXiv:2402.05109, 2024. 13

arXiv 2024

[9] [9]

Amphista: Bi-directional multi-head decoding for accelerating llm inference

Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, and Emad Barsoum. Amphista: Bi-directional multi-head decoding for accelerating llm inference. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

2025

[10] [10]

Gumiho: A hybrid architecture to prioritize early tokens in speculative decoding

Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith CH Ngai, and Emad Barsoum. Gumiho: A hybrid architecture to prioritize early tokens in speculative decoding. arXiv preprint arXiv:2503.10135, 2025

arXiv 2025

[11] [11]

Draft& verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft& verify: Lossless large language model acceleration via self-speculative decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263–11282, 2024

2024

[12] [12]

Resdecode: accelerating large language models inference via residual decoding heads

Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Fan Yu, Hongen Shao, and Xiaofeng Zou. Resdecode: accelerating large language models inference via residual decoding heads. Big Data Mining and Analytics, 8(4):779–793, 2025

2025

[13] [13]

Accelerating large language model decoding with speculative sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

Pith/arXiv arXiv 2023

[14] [14]

Unirec: Unified multimodal encoding for llm-based recommendations

Zijie Lei, Tao Feng, Zhigang Hua, Yan Xie, Guanyu Lin, Shuang Yang, Ge Liu, and Jiaxuan You. Unirec: Unified multimodal encoding for llm-based recommendations. arXiv preprint arXiv:2601.19423, 2026

Pith/arXiv arXiv 2026

[15] [15]

Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

arXiv 2026

[16] [16]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[17] [17]

Deepseek-v3 technical report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[18] [18]

Medusa: Simple llm inference acceleration framework with multiple decoding heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

Pith/arXiv arXiv 2024

[19] [19]

Better & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

Pith/arXiv arXiv 2024

[20] [20]

Eagle-3: Scaling up in- ference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up in- ference acceleration of large language models via training-time test. arXiv preprint arXiv:2503.01840, 2025

Pith/arXiv arXiv 2025

[21] [21]

Eagle: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fang yun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. 14

Pith/arXiv arXiv 2024

[22] [22]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

2024

[23] [23]

Cdm: A reliable metric for fair and accurate formula recognition evaluation

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, and Conghui He. Cdm: A reliable metric for fair and accurate formula recognition evaluation. arXiv preprint arXiv:2409.03643, 5(6), 2024

arXiv 2024

[24] [24]

Image-based table recognition: data, model, and evaluation

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. In European conference on computer vision, pages 564–580. Springer, 2020

2020

[25] [25]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

2025

[26] [26]

Lightonocr: A 1b end- to-end multilingual vision-language model for state-of-the-art ocr

Said Taghadouini, Adrien Cavaillès, and Baptiste Aubertin. Lightonocr: A 1b end- to-end multilingual vision-language model for state-of-the-art ocr. arXiv preprint arXiv:2601.14251, 2026

arXiv 2026

[27] [27]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861, 2021

arXiv 2021

[28] [28]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 15 Appendix A. Additional Results on Model Scaling Behavior ...

2023