arxiv: 2510.18234 · v1 · submitted 2025-10-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei , Yaofeng Sun , Yukun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords optical compressioncontext compressionOCRvision tokensDeepEncoderdocument understandingLLM efficiency2D mapping

0 comments

The pith

DeepSeek-OCR maps text contexts to 2D images for compression into vision tokens with high recovery accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether long text contexts can be compressed by rendering them as optical 2D images and processing them through a specialized encoder. If successful, this approach could allow language models to handle much longer contexts by converting text into a small number of vision tokens while still recovering the original content accurately. Experiments demonstrate that at compression ratios below 10 times, where text tokens are up to 10 times the vision tokens, OCR precision reaches 97 percent, and even at 20 times it holds around 60 percent. The system also shows practical advantages over existing OCR tools by using fewer tokens per page on benchmarks like OmniDocBench.

Core claim

DeepSeek-OCR uses DeepEncoder to compress high-resolution text images into a minimal set of vision tokens, which a decoder then converts back to text. This optical compression method maintains 97% decoding precision when the number of text tokens is within 10 times the vision tokens, and about 60% at 20 times, while outperforming other systems with significantly fewer tokens.

What carries the argument

The DeepEncoder, which processes high-resolution 2D image renderings of text to produce a compressed sequence of vision tokens while keeping activations low.

If this is right

Long contexts in LLMs can be stored more efficiently using optical representations.
OCR tasks on documents can be performed with reduced token counts compared to traditional methods.
Large-scale training data for vision-language models can be generated rapidly at rates exceeding 200k pages per day on a single GPU.
Potential applications in preserving and processing historical documents with memory-efficient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar optical mapping might extend to compressing other structured data like code or tables beyond plain text.
Investigating how the compression affects semantic understanding in downstream LLM tasks beyond raw OCR recovery.
Testing the approach on multilingual or handwritten documents to broaden its utility.

Load-bearing premise

That converting text to high-resolution 2D images and passing them through the DeepEncoder retains enough information about content and structure for the decoder to reconstruct the text accurately at the stated compression levels.

What would settle it

Applying the model to a diverse collection of complex documents, such as those with tables, figures, and varying layouts, and measuring if the OCR accuracy drops significantly below 97% at compression ratios under 10x.

read the original abstract

We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSeek-OCR shows text can be rendered to images, encoded into far fewer vision tokens, and recovered at 97% precision below 10x compression or 60% at 20x, with some OCR benchmark wins, but the supporting experiments stay thin on details.

read the letter

The paper's main point is that rendering text pages as high-resolution images, passing them through DeepEncoder to produce a controlled number of vision tokens, and decoding with a 3B MoE model can serve as context compression for LLMs. At compression ratios under 10x it reaches 97% decoding precision; at 20x it holds around 60%. It also reports beating GOT-OCR2.0 on OmniDocBench while using only 100 vision tokens per page instead of 256, and handling documents with far fewer tokens than MinerU2.0 on average. The code and weights are released, which lets others check the throughput claim of 200k+ pages per day on one A100-40G.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeepSeek-OCR for optical compression of long text contexts: text is rendered as high-resolution 2D images, encoded by DeepEncoder into a small number of vision tokens while keeping activations low, and decoded by DeepSeek3B-MoE-A570M to recover the original content. It reports 97% decoding (OCR) precision when text tokens are within 10x the vision tokens (<10x compression) and ~60% at 20x compression, plus superior results on OmniDocBench versus GOT-OCR2.0 (using 100 vs 256 tokens/page) and MinerU2.0 (using <800 vs 6000+ tokens/page), with production throughput of 200k+ pages/day on one A100-40G. Code and weights are released publicly.

Significance. If the reported precision and benchmark results hold under rigorous evaluation, the work could meaningfully advance long-context modeling by demonstrating a viable optical 2D compression route that drastically reduces token counts while retaining semantic and layout fidelity. The high-throughput data-generation capability and public release of code/weights are concrete strengths that would support follow-on research on memory mechanisms and historical document processing.

major comments (3)

[Abstract] Abstract: The central claims of 97% OCR precision at compression ratios <10x and 60% at 20x, together with the OmniDocBench outperformance figures, are stated without any accompanying experimental details, dataset descriptions, error bars, controls, or per-category breakdowns. This absence makes it impossible to verify whether the encoder preserves layout cues for tables, formulas, multi-column text, or non-standard fonts outside the training distribution.
[DeepEncoder] DeepEncoder description (throughout): The architecture is characterized only at a high level as maintaining low activations under high-resolution input and achieving high compression; no equations, layer counts, token-reduction mechanism, or ablation on information loss for semantic/layout elements are supplied, leaving the core assumption that rasterized text yields reconstructible vision tokens untestable.
[Experiments / OmniDocBench] OmniDocBench and production claims: The token-count comparisons (100 vision tokens vs. GOT-OCR2.0 at 256; <800 vs. MinerU2.0 at 6000+) and the 200k+ pages/day throughput are presented without specifying input resolution, exact benchmark protocol, model-size controls, or how 'decoding (OCR) precision' is computed, rendering the practical-value assertions unverifiable.

minor comments (2)

The phrase 'decoding (OCR) precision' is introduced without a formal definition, formula, or reference to standard OCR metrics (e.g., CER, WER).
No architecture diagram or training-recipe summary is provided despite the two-component system being central to the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive comments. We agree that the manuscript would benefit from additional experimental details, architectural specifications, and benchmark protocols to enhance verifiability. We will undertake a major revision to address these points by expanding the abstract, adding a technical description of DeepEncoder with equations and ablations, and providing full benchmark protocols, input resolutions, metric definitions, and per-category breakdowns in the Experiments section.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 97% OCR precision at compression ratios <10x and 60% at 20x, together with the OmniDocBench outperformance figures, are stated without any accompanying experimental details, dataset descriptions, error bars, controls, or per-category breakdowns. This absence makes it impossible to verify whether the encoder preserves layout cues for tables, formulas, multi-column text, or non-standard fonts outside the training distribution.

Authors: We agree that the abstract is concise and omits supporting details. In the revised manuscript, we will expand the abstract to briefly reference the evaluation datasets (including OmniDocBench), the definition of decoding (OCR) precision, and note that full experimental protocols, error bars from repeated evaluations, and per-category breakdowns (covering tables, formulas, multi-column layouts, and varied fonts) are provided in the Experiments section. This will improve verifiability without exceeding abstract length constraints. revision: yes
Referee: [DeepEncoder] DeepEncoder description (throughout): The architecture is characterized only at a high level as maintaining low activations under high-resolution input and achieving high compression; no equations, layer counts, token-reduction mechanism, or ablation on information loss for semantic/layout elements are supplied, leaving the core assumption that rasterized text yields reconstructible vision tokens untestable.

Authors: The current description focuses on the high-level design to emphasize the optical compression application. In the revision, we will add a dedicated subsection detailing the DeepEncoder architecture, including layer counts, the token-reduction mechanism (via efficient downsampling and attention), relevant equations for activation control and compression, and ablation studies quantifying information retention for semantic content and layout elements such as tables, formulas, and multi-column text. The publicly released code will be cross-referenced for exact implementation. revision: yes
Referee: [Experiments / OmniDocBench] OmniDocBench and production claims: The token-count comparisons (100 vision tokens vs. GOT-OCR2.0 at 256; <800 vs. MinerU2.0 at 6000+) and the 200k+ pages/day throughput are presented without specifying input resolution, exact benchmark protocol, model-size controls, or how 'decoding (OCR) precision' is computed, rendering the practical-value assertions unverifiable.

Authors: We acknowledge these omissions limit verifiability. In the revised Experiments section, we will specify input image resolutions for text rendering, the exact OmniDocBench evaluation protocol (including page processing and comparison methodology), model-size controls, and the precise computation of decoding (OCR) precision (e.g., character- or token-level exact match). We will also detail the throughput measurement (hardware configuration, batch sizes, and scaling on A100-40G) and add per-category results to demonstrate robustness on tables, formulas, and non-standard fonts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description and measured OCR performance

full rationale

The paper presents DeepSeek-OCR as an engineering system consisting of DeepEncoder and a decoder, with all central claims resting on reported experimental measurements of decoding precision at given compression ratios and benchmark comparisons on OmniDocBench. No equations, derivations, or first-principles arguments are advanced that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work to justify core results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on neural-network weights obtained through training; these constitute free parameters whose values are not derived from first principles. No additional invented entities are introduced. The design assumes standard properties of vision encoders and MoE decoders.

free parameters (1)

vision token count per page
The specific numbers (100 and <800) are chosen operating points rather than derived quantities.

axioms (1)

domain assumption High-resolution text images can be encoded into a compact set of vision tokens that retain sufficient information for accurate reconstruction
Invoked by the design of DeepEncoder and the reported OCR accuracies.

pith-pipeline@v0.9.0 · 5559 in / 1399 out tokens · 56560 ms · 2026-05-12T04:30:40.223189+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
cs.CV 2026-05 conditional novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
Visual Text Compression as Measure Transport
cs.CV 2026-05 unverdicted novelty 7.0

Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing
cs.CL 2026-04 conditional novelty 7.0

A non-reversible hashing technique allows legal distribution of annotations for copyrighted texts by enabling alignment between user-owned copies and shared hashed data with high accuracy.
A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

A minimal embedding model shows representation collapse arises from frustrated samples through slow dynamics and is prevented by stop-gradient.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
cs.SI 2026-04 unverdicted novelty 7.0

IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
cs.IR 2026-03 unverdicted novelty 7.0

Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
cs.CV 2026-05 conditional novelty 6.0

Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
cs.CV 2026-05 unverdicted novelty 6.0

CoE applies vision-language models directly to document screenshots to deliver pixel-level bounding-box attribution for evidence in iterative retrieval-augmented generation, outperforming text baselines on visual-layo...
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 6.0

RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
cs.CR 2026-04 unverdicted novelty 6.0

SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
Can MLLMs "Read" What is Missing?
cs.AI 2026-04 unverdicted novelty 6.0

MMTR-Bench shows that current MLLMs face significant difficulty reconstructing masked text from visual context, especially at sentence and paragraph lengths.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
cs.SE 2026-04 unverdicted novelty 6.0

Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
InstructTable: Improving Table Structure Recognition Through Instructions
cs.CV 2026-04 unverdicted novelty 6.0

InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
cs.CV 2026-04 unverdicted novelty 6.0

A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
Token-Efficient Multimodal Reasoning via Image Prompt Packaging
cs.CV 2026-04 unverdicted novelty 6.0

IPPg embeds text into images to reduce multimodal model inference costs by 35.8-91% with competitive accuracy on many VQA and code benchmarks.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
cs.CV 2026-03 conditional novelty 6.0

PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
cs.CV 2026-03 unverdicted novelty 6.0

A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
Logics-Parsing-Omni Technical Report
cs.AI 2026-03 unverdicted novelty 6.0

Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
cs.CV 2026-05 conditional novelty 5.0

Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
cs.CL 2026-05 unverdicted novelty 5.0

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
cs.CL 2026-05 unverdicted novelty 5.0

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
cs.CV 2026-05 unverdicted novelty 5.0

LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
Sema: Semantic Transport for Real-Time Multimodal Agents
cs.MM 2026-04 unverdicted novelty 5.0

Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
cs.CL 2026-04 unverdicted novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 4.0

RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
Memory as Metabolism: A Design for Companion Knowledge Systems
cs.AI 2026-04 unverdicted novelty 4.0

This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 40 Pith papers · 12 internal anchors

[1]

URLhttps://github.com/datalab-to/marker

Marker. URLhttps://github.com/datalab-to/marker

work page
[2]

URLhttps://mathpix.com/

Mathpix. URLhttps://mathpix.com/

work page
[3]

URLhttps://github.com/chatdoc-com/OCRFlux

Ocrflux, 2025. URLhttps://github.com/chatdoc-com/OCRFlux

work page 2025
[4]

G. AI. Gemini 2.5-pro, 2025. URLhttps://gemini.google.com/

work page 2025
[5]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P . Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

arXiv preprint arXiv:2308.13418 , year=

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023

work page arXiv 2023
[7]

J. Chen, L. Kong, H. Wei, C. Liu, Z. Ge, L. Zhao, J. Sun, C. Han, and X. Zhang. Onechart: Purify the chart structural extraction via one auxiliary token. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 147–155, 2024

work page 2024
[8]

Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024

work page internal anchor Pith review arXiv 2024
[9]

C. Cui, T. Sun, M. Lin, T. Gao, Y. Zhang, J. Liu, X. Wang, Z. Zhang, C. Zhou, H. Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review arXiv 2025
[10]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P . Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:3632–3656, 2023

work page 2023
[11]

H. Feng, S. Wei, X. Fei, W. Shi, Y. Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, et al. Dol- phin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

work page arXiv 2025
[12]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[13]

J. Gu, X. Meng, G. Lu, L. Hou, N. Minzhe, X. Liang, L. Yao, R. Huang, W. Zhang, X. Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431, 2022

work page 2022
[14]

HAI-LLM: Efficient and lightweight training tool for large models, 2023

High-flyer. HAI-LLM: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023
[15]

S. Iyer, X. V . Lin, R. Pasunuru, T. Mihaylov, D. Simig, P . Yu, K. Shuster, T. Wang, Q. Liu, P . S. Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022

work page arXiv 2022
[16]

Kazemzadeh, V

S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 20

work page 2014
[17]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Z. Li, Y. Liu, Q. Liu, Z. Ma, Z. Zhang, S. Zhang, Z. Guo, J. Zhang, X. Wang, and X. Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025

work page arXiv 2025
[19]

A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

C. Liu, H. Wei, J. Chen, L. Kong, Z. Ge, Z. Zhu, L. Zhao, J. Sun, C. Han, and X. Zhang. Focus anywhere for fine-grained multi-page document understanding. arXiv preprint arXiv:2405.14295, 2024

work page arXiv 2024
[22]

SGDR: Stochastic Gradient Descent with Warm Restarts

I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019

work page 2019
[24]

Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

work page arXiv 2022
[25]

arXiv preprint arXiv:2503.11576 , year=

A. Nassar, A. Marafioti, M. Omenetti, M. Lysak, N. Livathinos, C. Auer, L. Morin, R. T. de Lima, Y. Kim, A. S. Gurbuz, et al. Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion. arXiv preprint arXiv:2503.11576, 2025

work page arXiv 2025
[26]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[27]

Ouyang, Y

L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

work page 2025
[28]

olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, C. Wilhelm, K. Lo, and L. Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025

work page arXiv 2025
[29]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[30]

dots.ocr, 2025

Rednote. dots.ocr, 2025. URLhttps://github.com/rednote-hilab/dots.ocr

work page 2025
[31]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image- text pairs. arXiv preprint arXiv:2111.02114, 2021. 21

work page internal anchor Pith review arXiv 2021
[32]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. To- wards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[33]

T. Sun, C. Cui, Y. Du, and Y. Liu. Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction. arXiv preprint arXiv:2503.17213, 2025

work page arXiv 2025
[34]

B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y. Qu, F. Shang, et al. Mineru: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839, 2024

work page arXiv 2024
[35]

P . Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Computer Vision, pages 408–424. Springer, 2024

work page 2024
[37]

H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, E. Yu, J. Sun, C. Han, and X. Zhang. Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024

work page arXiv 2024
[38]

H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y. Xu, Z. Ge, L. Zhao, J. Sun, Y. Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704, 2024

work page arXiv 2024
[39]

H. Wei, Y. Yin, Y. Li, J. Wang, L. Zhao, J. Sun, Z. Ge, X. Zhang, and D. Jiang. Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024

work page arXiv 2024
[40]

Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal un- derstanding. arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 22

work page internal anchor Pith review Pith/arXiv arXiv 2025