pith. machine review for the scientific record. sign in

arxiv: 2605.00392 · v2 · submitted 2026-05-01 · 💻 cs.CV · cs.LG

Recognition: no theorem link

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:07 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords token pruningDeepSeek-OCRvisual token compressionoptimal transport mergingOCR inferencetwo-stage pruningattention trajectorydynamic pruning ratio
0
0 comments X

The pith

RTPrune applies a two-stage pruning process to DeepSeek-OCR that first keeps high-norm visual tokens and then merges the rest with optimal transport to cut inference time while preserving OCR accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that DeepSeek-OCR follows a repeatable two-stage decoding pattern: it first focuses on high-norm tokens carrying the main textual and structural signals, then spreads attention across the remaining tokens. RTPrune mirrors this pattern by pruning in two steps and adding a dynamic ratio that changes with token similarity and text density. This keeps 84 percent of the tokens yet reaches 99.47 percent accuracy and 1.23 times faster prefill on a standard document benchmark. The approach matters because it lets large vision-language OCR models run faster on ordinary hardware without the accuracy loss that earlier pruning methods produced on text-heavy images.

Core claim

DeepSeek-OCR exhibits a distinct two-stage reading trajectory during decoding in which the model initially prioritizes the majority of high-norm tokens and subsequently redistributes attention to the remaining tokens; exploiting this trajectory through a tailored two-stage prune that first prioritizes high-norm visual tokens and then pairs and merges the rest via optimal transport, together with a dynamic pruning ratio that adapts to token similarity and textual density, produces efficient feature aggregation that preserves textual fidelity for OCR tasks.

What carries the argument

The two-stage token pruning mechanism: stage one prioritizes high-norm visual tokens, stage two pairs and merges the remaining tokens according to optimal transport theory, all governed by an adaptive ratio based on similarity and textual density.

If this is right

  • DeepSeek-OCR-Large reaches 99.47 percent accuracy on OmniDocBench while using only 84.25 percent of the original visual tokens.
  • Prefill stage runs 1.23 times faster than the unpruned baseline under the same hardware conditions.
  • The method outperforms prior token-pruning techniques that were designed for general vision-language models on text-dense OCR workloads.
  • A single dynamic ratio schedule suffices to balance speed and fidelity without manual tuning per document.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same high-norm-plus-optimal-transport pattern might accelerate other vision-language models whose decoding shows an early focus on salient tokens followed by attention redistribution.
  • Applying the dynamic ratio to non-OCR image tasks could reveal whether textual density is the main driver or whether structural density in diagrams matters equally.
  • Combining RTPrune with existing visual-text compression layers inside DeepSeek-OCR might produce further multiplicative speed-ups on very long documents.

Load-bearing premise

The two-stage reading trajectory observed in DeepSeek-OCR stays stable enough across inputs that prioritizing high-norm tokens first and then merging the rest with optimal transport will not lose the textual details needed for accurate OCR.

What would settle it

Measuring OCR accuracy on a new document set where the model's internal attention does not follow the high-norm-first then redistribute pattern and finding that accuracy falls below 99 percent even at the reported 84 percent retention rate.

Figures

Figures reproduced from arXiv: 2605.00392 by Ben Wan, Jia Wang, Tongxuan Liu, Weizhe Huang, Yan Feng, Yuting Zeng, Zihan Tang.

Figure 1
Figure 1. Figure 1: Performance and efficiency. (left) Our RTPrune con￾sistently outperforms prior token pruning methods on DeepSeek￾OCR, retaining over 97.88% of accuracy with 84% of visual to￾kens on olmOCR-Bench. (right) Our RTPrune reduces GFLOPs by nearly 15.29% and prefill time by nearly 18.90% on Om￾niDocBench when maintaining 99.47% accuracy. nition (OCR) tasks. DeepSeek-OCR (Wei et al., 2025) not only overcomes this … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different token pruning methods on DeepSeek-OCR-Base. The patches highlighted in blue are pruned and the text highlighted in red indicates discrepancies with the ground truth. While methods based on original image, attention scores, textual relevance, or inter-token similarity fail to generate text accurately, our approach precisely captures tokens containing critical textual information, lea… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RTPrune. Our framework dynamically determines the pruning ratio by evaluating inter-token similarity and image-level textual density. The process then bifurcates into dominant token selection via embedding ℓ2-norms and residual information integration through optimal transport-based merging. As a training-free and model-agnostic approach, RTPrune mimics the dual-pass reading behavior of the LLM… view at source ↗
Figure 3
Figure 3. Figure 3: Top-K Intersection Ratio (TIR) between high-norm visual embeddings and LLM high-attention viusal tokens during prefill, where the violin plot shows the TIR with high-attention tokens at each individual LLM layer, while the line plot reports the TIR with the union of high-attention tokens from the current and all previous layers, evaluated on OmniDocBench (10% subset) by DeepSeek-OCR-Base. of high-norm toke… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different token pruning methods on DeepSeek-OCR-Base. The patches highlighted in blue are pruned and the text highlighted in red indicates discrepancies with the ground truth. While methods based on original image, attention scores, textual relevance, or inter-token similarity fail to generate text accurately, our approach precisely captures tokens containing critical textual information, lea… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of RTPrune. Our framework dynamically determines the pruning ratio by evaluating inter-token similarity and image-level textual density. The process then bifurcates into dominant token selection via embedding ℓ2-norms and residual information integration through optimal transport-based merging. As a training-free and model-agnostic approach, RTPrune mimics the dual-pass reading behavior of the LLM… view at source ↗
Figure 4
Figure 4. Figure 4: Top-K Intersection Ratio (TIR) between high-norm visual embeddings and LLM high-attention viusal tokens during prefill, where the violin plot shows the TIR with high-attention tokens at each individual LLM layer, while the line plot reports the TIR with the union of high-attention tokens from the current and all previous layers, evaluated on OmniDocBench (10% subset) by DeepSeek-OCR-Base. information, whil… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of various pruning methods with dynamic pruning strategy on ocean-OCR Benchmark using DeepSeek-OCR-Gundam across multiple noteworthy OCR abili￾ties. E, F, P, R, B, and M are the abbreviations for Edit Distance, F1- Score, Precision, Recall, BLEU, and METEOR respectively. For Edit Distance, the plotted score is computed with xafter = 1−xbefore for better visualization. olmOCR-Bench. W… view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of visual tokens redundancy on DeepSeek-OCR-Base. The patch highlighted in blue is pruned and the patches highlighted in red are kept. A.2. LLM Attention-based Methods To explore the significance of visual tokens in terms of the attention they receive within the language model, we conducted experiments by pruning the bottom 25% of visual tokens based on attention scores at each layer. As sho… view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of visual tokens redundancy on DeepSeek-OCR-Base. The patch highlighted in blue is pruned and the patches highlighted in red are kept. A.2. LLM Attention-based Methods To explore the significance of visual tokens in terms of the attention they receive within the language model, we conducted experiments by pruning the bottom 25% of visual tokens based on attention scores at each layer. As sho… view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of textual relevance in different CLIP model. The color gradient indicates the degree of relevance, where cooler blue tones represent lower correlation and warmer red tones denote higher relevance. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of textual relevance in different CLIP model. The color gradient indicates the degree of relevance, where cooler blue tones represent lower correlation and warmer red tones denote higher relevance. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes RTPrune, a two-stage token pruning method for DeepSeek-OCR. It identifies a distinct two-stage reading trajectory in which the model first prioritizes high-norm visual tokens and later redistributes attention to the remainder. Stage one prunes high-norm tokens that capture salient textual and structural information; stage two pairs and merges the remaining tokens via optimal transport for feature aggregation. A dynamic pruning ratio adapts to token similarity and textual density. The abstract reports state-of-the-art results of 99.47% accuracy and 1.23× faster prefill on OmniDocBench at 84.25% token retention when applied to DeepSeek-OCR-Large.

Significance. If the two-stage trajectory observation holds and the pruning/merging steps preserve textual fidelity, the approach could meaningfully accelerate OCR inference in VLMs while maintaining high accuracy. The use of optimal transport for merging and a task-specific dynamic ratio are potentially useful ideas. However, the abstract supplies no baselines, ablations, error bars, or implementation details, so the practical significance and reproducibility cannot be evaluated from the provided text.

major comments (1)
  1. [Abstract] Abstract: the abstract states strong numerical results (99.47% accuracy, 1.23× faster prefill) but supplies no baselines, error bars, ablation studies, or implementation details; without these it is impossible to verify whether the claimed accuracy and speedup are supported by the data.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract states strong numerical results (99.47% accuracy, 1.23× faster prefill) but supplies no baselines, error bars, ablation studies, or implementation details; without these it is impossible to verify whether the claimed accuracy and speedup are supported by the data.

    Authors: We agree that the abstract, due to strict length limits, does not include explicit baselines, error bars, ablations, or implementation details. These elements are fully detailed in the Experiments section of the manuscript, including comparisons to prior token-pruning methods on OmniDocBench, ablation studies on the two-stage trajectory and dynamic ratio, results with standard deviations across runs, and implementation specifics in the appendix. To improve verifiability from the abstract alone, we will revise it to briefly note that the reported accuracy and speedup are achieved relative to existing pruning baselines while retaining 84.25% of tokens. This change will be made in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected from abstract

full rationale

The abstract presents a high-level motivation from an observed two-stage reading trajectory in DeepSeek-OCR, followed by a proposed two-stage pruning method (high-norm prioritization then optimal-transport merging) and a dynamic ratio adapting to token similarity and density. No equations, parameter-fitting procedures, self-citations, or derivations are provided that could reduce the claimed method or performance results to inputs by construction. The reported accuracy and speedup are experimental outcomes, not predictions forced by the pruning rule itself. With only the abstract available, no load-bearing circular steps of any enumerated kind are identifiable; the derivation chain remains self-contained at the level of description.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that the model's internal decoding trajectory can be directly translated into a pruning policy and that optimal transport merging preserves OCR-relevant features.

free parameters (1)
  • dynamic pruning ratio
    The ratio is stated to adapt to token similarity and textual density; its exact functional form or thresholds are not specified and may require tuning.
axioms (1)
  • domain assumption DeepSeek-OCR exhibits a distinct two-stage reading trajectory that first prioritizes high-norm tokens.
    This observation from decoding analysis is the direct motivation for the first pruning stage.

pith-pipeline@v0.9.0 · 5515 in / 1382 out tokens · 90032 ms · 2026-05-12T03:07:12.746382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  2. [2]

    arXiv preprint arXiv:2501.15558 , year=

    Chen, S., Guo, X., Li, Y ., Zhang, T., Lin, M., Kuang, D., Zhang, Y ., Ming, L., Zhang, F., Wang, Y ., et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558,

  3. [3]

    PaddleOCR 3.0 Technical Report

    Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y ., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595,

  4. [4]

    Vision transformers need registers

    Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. InInternational Conference on Learning Representations, volume 2024, pp. 2632– 2652,

  5. [5]

    Scope: Saliency-coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214, 2025

    Deng, J., Li, W., Zhou, J. T., and He, Y . Scope: Saliency- coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214,

  6. [6]

    Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

    Duan, S., Xue, Y ., Wang, W., Su, Z., Liu, H., Yang, S., Gan, G., Wang, G., Wang, Z., Yan, S., et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

  7. [7]

    N\" uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951, 2026

    Huang, Y ., Ma, F., Shao, Y ., Guo, J., Yu, Z., Cui, L., and Tian, Q. N \” uwa: Mending the spatial integrity torn by vlm token pruning.arXiv preprint arXiv:2602.02951,

  8. [8]

    Dcp: Dual-cue pruning for efficient large vision-language models

    Jiang, L., Zhang, Z., Zeng, Y ., Xie, C., Liu, T., Li, Z., Cheng, L., and Xu, X. Dcp: Dual-cue pruning for efficient large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 21202–21215, 2025a. Jiang, Y ., Wu, Q., Lin, W., Yu, W., and Zhou, Y . What kind of visual tokens do we need? traini...

  9. [9]

    and Giese, M

    Lappe, A. and Giese, M. A. Register and cls tokens yield a decoupling of local and global features in large vits. arXiv preprint arXiv:2505.05892,

  10. [10]

    Balanced token pruning: Accelerating vision language models beyond local optimization.CoRR, abs/2505.22038, 2025

    Li, K., Chen, X., Gao, C., Li, Y ., and Chen, X. Bal- anced token pruning: Accelerating vision language models beyond local optimization.arXiv preprint arXiv:2505.22038, 2025a. Li, Y ., Yang, G., Liu, H., Wang, B., and Zhang, C. dots. ocr: Multilingual document layout parsing in a single vision-language model.arXiv preprint arXiv:2512.02498, 2025b. Liu, H...

  11. [11]

    Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024b. 10 RTPru...

  12. [12]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025

    Poznanski, J., Rangapur, A., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D., Wilhelm, C., Lo, K., and Soldaini, L. olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a. Poznanski, J., Soldaini, L., and Lo, K. olmocr 2: Unit test rewards for document ocr, 2025b. URL https: //arxiv.org/abs/25...

  13. [13]

    A 3x3 isotropic gradient operator for image processing.a talk at the Stanford Artificial Project in, 1968:271–272,

    Sobel, I., Feldman, G., et al. A 3x3 isotropic gradient operator for image processing.a talk at the Stanford Artificial Project in, 1968:271–272,

  14. [14]

    Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

    Taghadouini, S., Cavaill`es, A., and Aubertin, B. Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

  15. [15]

    DeepSeek-OCR: Contexts Optical Compression

    URL https://arxiv.org/ abs/2510.18234. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

  16. [16]

    Stop looking for important tokens in multimodal language models: Duplication matters more

    Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., and Zhang, L. Stop looking for important tokens in multimodal language models: Duplication matters more. arXiv preprint arXiv:2502.11494,

  17. [17]

    Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

    Yao, L., Xing, L., Shi, Y ., Li, S., Liu, Y ., Dong, Y ., Zhang, Y .-F., Li, L., Dong, Q., Dong, X., et al. Towards efficient multimodal large language models: A survey on token compression.TechRxiv, 2026(0112),

  18. [18]

    URL https://www.techrxiv.org/doi/abs/10

    doi: 10.36227/techrxiv.176823010.07236701/v1. URL https://www.techrxiv.org/doi/abs/10. 36227/techrxiv.176823010.07236701/v1. Ye, W., Wu, Q., Lin, W., and Zhou, Y . Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 22128– 22136,

  19. [19]

    highlighted tokens

    Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J. A., Hu, K., Liu, S., Zhang, Y ., Yang, J., Li, C., et al. Lmms- eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 881–916, 2025a. Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y ., Pan, J., She, Q., and Zhang, S. Beyo...

  20. [20]

    Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP

    models. Following visualization in CDPruner (Zhang et al., 2025b), Figure 7 visualizes the correlation between the input prompt text and the image embeddings processed by different CLIP. Compared to CLIP- ViT-L/14-336px, CLIP-ViT-B/16 inherently exhibits weaker multi-modal alignment. This alignment is further attenuated as the CLIP model within DeepEncode...

  21. [21]

    and a shared expert (intermediate dimension 1792), activating only 570M parameters per inference to balance capacity and efficiency. A core optimization lies in dynamic token adaptation: aligned with DeepEncoder’s compression ratio, the decoder adjusts token counts layer-wise during decoding, reducing self-attention’s n2 complexity. Combined with graph-ba...

  22. [22]

    :Ocean-OCR is a state-of-the-art OCR benchmark tailored for evaluating advanced document understanding capabilities, including complex layout parsing, multi-modal content recognition (e.g., charts, diagrams, and handwritten text), and cross-language OCR performance. The benchmark features a diverse dataset of documents from various domains (e.g., academia...