Recognition: 2 theorem links
· Lean TheoremDeepSeek-OCR: Contexts Optical Compression
Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3
The pith
DeepSeek-OCR maps text contexts to 2D images for compression into vision tokens with high recovery accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeek-OCR uses DeepEncoder to compress high-resolution text images into a minimal set of vision tokens, which a decoder then converts back to text. This optical compression method maintains 97% decoding precision when the number of text tokens is within 10 times the vision tokens, and about 60% at 20 times, while outperforming other systems with significantly fewer tokens.
What carries the argument
The DeepEncoder, which processes high-resolution 2D image renderings of text to produce a compressed sequence of vision tokens while keeping activations low.
If this is right
- Long contexts in LLMs can be stored more efficiently using optical representations.
- OCR tasks on documents can be performed with reduced token counts compared to traditional methods.
- Large-scale training data for vision-language models can be generated rapidly at rates exceeding 200k pages per day on a single GPU.
- Potential applications in preserving and processing historical documents with memory-efficient methods.
Where Pith is reading between the lines
- Similar optical mapping might extend to compressing other structured data like code or tables beyond plain text.
- Investigating how the compression affects semantic understanding in downstream LLM tasks beyond raw OCR recovery.
- Testing the approach on multilingual or handwritten documents to broaden its utility.
Load-bearing premise
That converting text to high-resolution 2D images and passing them through the DeepEncoder retains enough information about content and structure for the decoder to reconstruct the text accurately at the stated compression levels.
What would settle it
Applying the model to a diverse collection of complex documents, such as those with tables, figures, and varying layouts, and measuring if the OCR accuracy drops significantly below 97% at compression ratios under 10x.
read the original abstract
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepSeek-OCR for optical compression of long text contexts: text is rendered as high-resolution 2D images, encoded by DeepEncoder into a small number of vision tokens while keeping activations low, and decoded by DeepSeek3B-MoE-A570M to recover the original content. It reports 97% decoding (OCR) precision when text tokens are within 10x the vision tokens (<10x compression) and ~60% at 20x compression, plus superior results on OmniDocBench versus GOT-OCR2.0 (using 100 vs 256 tokens/page) and MinerU2.0 (using <800 vs 6000+ tokens/page), with production throughput of 200k+ pages/day on one A100-40G. Code and weights are released publicly.
Significance. If the reported precision and benchmark results hold under rigorous evaluation, the work could meaningfully advance long-context modeling by demonstrating a viable optical 2D compression route that drastically reduces token counts while retaining semantic and layout fidelity. The high-throughput data-generation capability and public release of code/weights are concrete strengths that would support follow-on research on memory mechanisms and historical document processing.
major comments (3)
- [Abstract] Abstract: The central claims of 97% OCR precision at compression ratios <10x and 60% at 20x, together with the OmniDocBench outperformance figures, are stated without any accompanying experimental details, dataset descriptions, error bars, controls, or per-category breakdowns. This absence makes it impossible to verify whether the encoder preserves layout cues for tables, formulas, multi-column text, or non-standard fonts outside the training distribution.
- [DeepEncoder] DeepEncoder description (throughout): The architecture is characterized only at a high level as maintaining low activations under high-resolution input and achieving high compression; no equations, layer counts, token-reduction mechanism, or ablation on information loss for semantic/layout elements are supplied, leaving the core assumption that rasterized text yields reconstructible vision tokens untestable.
- [Experiments / OmniDocBench] OmniDocBench and production claims: The token-count comparisons (100 vision tokens vs. GOT-OCR2.0 at 256; <800 vs. MinerU2.0 at 6000+) and the 200k+ pages/day throughput are presented without specifying input resolution, exact benchmark protocol, model-size controls, or how 'decoding (OCR) precision' is computed, rendering the practical-value assertions unverifiable.
minor comments (2)
- The phrase 'decoding (OCR) precision' is introduced without a formal definition, formula, or reference to standard OCR metrics (e.g., CER, WER).
- No architecture diagram or training-recipe summary is provided despite the two-component system being central to the contribution.
Simulated Author's Rebuttal
We sincerely thank the referee for the detailed and constructive comments. We agree that the manuscript would benefit from additional experimental details, architectural specifications, and benchmark protocols to enhance verifiability. We will undertake a major revision to address these points by expanding the abstract, adding a technical description of DeepEncoder with equations and ablations, and providing full benchmark protocols, input resolutions, metric definitions, and per-category breakdowns in the Experiments section.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 97% OCR precision at compression ratios <10x and 60% at 20x, together with the OmniDocBench outperformance figures, are stated without any accompanying experimental details, dataset descriptions, error bars, controls, or per-category breakdowns. This absence makes it impossible to verify whether the encoder preserves layout cues for tables, formulas, multi-column text, or non-standard fonts outside the training distribution.
Authors: We agree that the abstract is concise and omits supporting details. In the revised manuscript, we will expand the abstract to briefly reference the evaluation datasets (including OmniDocBench), the definition of decoding (OCR) precision, and note that full experimental protocols, error bars from repeated evaluations, and per-category breakdowns (covering tables, formulas, multi-column layouts, and varied fonts) are provided in the Experiments section. This will improve verifiability without exceeding abstract length constraints. revision: yes
-
Referee: [DeepEncoder] DeepEncoder description (throughout): The architecture is characterized only at a high level as maintaining low activations under high-resolution input and achieving high compression; no equations, layer counts, token-reduction mechanism, or ablation on information loss for semantic/layout elements are supplied, leaving the core assumption that rasterized text yields reconstructible vision tokens untestable.
Authors: The current description focuses on the high-level design to emphasize the optical compression application. In the revision, we will add a dedicated subsection detailing the DeepEncoder architecture, including layer counts, the token-reduction mechanism (via efficient downsampling and attention), relevant equations for activation control and compression, and ablation studies quantifying information retention for semantic content and layout elements such as tables, formulas, and multi-column text. The publicly released code will be cross-referenced for exact implementation. revision: yes
-
Referee: [Experiments / OmniDocBench] OmniDocBench and production claims: The token-count comparisons (100 vision tokens vs. GOT-OCR2.0 at 256; <800 vs. MinerU2.0 at 6000+) and the 200k+ pages/day throughput are presented without specifying input resolution, exact benchmark protocol, model-size controls, or how 'decoding (OCR) precision' is computed, rendering the practical-value assertions unverifiable.
Authors: We acknowledge these omissions limit verifiability. In the revised Experiments section, we will specify input image resolutions for text rendering, the exact OmniDocBench evaluation protocol (including page processing and comparison methodology), model-size controls, and the precise computation of decoding (OCR) precision (e.g., character- or token-level exact match). We will also detail the throughput measurement (hardware configuration, batch sizes, and scaling on A100-40G) and add per-category results to demonstrate robustness on tables, formulas, and non-standard fonts. revision: yes
Circularity Check
No circularity: empirical system description and measured OCR performance
full rationale
The paper presents DeepSeek-OCR as an engineering system consisting of DeepEncoder and a decoder, with all central claims resting on reported experimental measurements of decoding precision at given compression ratios and benchmark comparisons on OmniDocBench. No equations, derivations, or first-principles arguments are advanced that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work to justify core results.
Axiom & Free-Parameter Ledger
free parameters (1)
- vision token count per page
axioms (1)
- domain assumption High-resolution text images can be encoded into a compact set of vision tokens that retain sufficient information for accurate reconstruction
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 42 Pith papers
-
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
Visual Text Compression as Measure Transport
Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing
A non-reversible hashing technique allows legal distribution of annotations for copyrighted texts by enabling alignment between user-owned copies and shared hashed data with high accuracy.
-
A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics
A minimal embedding model shows representation collapse arises from frustrated samples through slow dynamics and is prevented by stop-gradient.
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
-
IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
-
From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
-
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
-
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
-
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
CoE applies vision-language models directly to document screenshots to deliver pixel-level bounding-box attribution for evidence in iterative retrieval-augmented generation, outperforming text baselines on visual-layo...
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
-
Can MLLMs "Read" What is Missing?
MMTR-Bench shows that current MLLMs face significant difficulty reconstructing masked text from visual context, especially at sentence and paragraph lengths.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
InstructTable: Improving Table Structure Recognition Through Instructions
InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...
-
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
-
Token-Efficient Multimodal Reasoning via Image Prompt Packaging
IPPg embeds text into images to reduce multimodal model inference costs by 35.8-91% with competitive accuracy on many VQA and code benchmarks.
-
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
-
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
-
Logics-Parsing-Omni Technical Report
Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.
-
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
-
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
-
Sema: Semantic Transport for Real-Time Multimodal Agents
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
-
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
-
Memory as Metabolism: A Design for Companion Knowledge Systems
This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
URLhttps://github.com/chatdoc-com/OCRFlux
Ocrflux, 2025. URLhttps://github.com/chatdoc-com/OCRFlux
work page 2025
-
[4]
G. AI. Gemini 2.5-pro, 2025. URLhttps://gemini.google.com/
work page 2025
-
[5]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P . Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
arXiv preprint arXiv:2308.13418 , year=
L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023
-
[7]
J. Chen, L. Kong, H. Wei, C. Liu, Z. Ge, L. Zhao, J. Sun, C. Han, and X. Zhang. Onechart: Purify the chart structural extraction via one auxiliary token. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 147–155, 2024
work page 2024
-
[8]
Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
C. Cui, T. Sun, M. Lin, T. Gao, Y. Zhang, J. Liu, X. Wang, Z. Zhang, C. Zhou, H. Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
M. Dehghani, J. Djolonga, B. Mustafa, P . Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:3632–3656, 2023
work page 2023
- [11]
- [12]
-
[13]
J. Gu, X. Meng, G. Lu, L. Hou, N. Minzhe, X. Liang, L. Yao, R. Huang, W. Zhang, X. Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431, 2022
work page 2022
-
[14]
HAI-LLM: Efficient and lightweight training tool for large models, 2023
High-flyer. HAI-LLM: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
- [15]
-
[16]
S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 20
work page 2014
-
[17]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [18]
-
[19]
A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [21]
-
[22]
SGDR: Stochastic Gradient Descent with Warm Restarts
I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019
work page 2019
-
[24]
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022
-
[25]
arXiv preprint arXiv:2503.11576 , year=
A. Nassar, A. Marafioti, M. Omenetti, M. Lysak, N. Livathinos, C. Auer, L. Morin, R. T. de Lima, Y. Kim, A. S. Gurbuz, et al. Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion. arXiv preprint arXiv:2503.11576, 2025
- [26]
-
[27]
L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025
work page 2025
-
[28]
J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, C. Wilhelm, K. Lo, and L. Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025
-
[29]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
- [30]
-
[31]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image- text pairs. arXiv preprint arXiv:2111.02114, 2021. 21
work page internal anchor Pith review arXiv 2021
- [32]
- [33]
- [34]
-
[35]
P . Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Computer Vision, pages 408–424. Springer, 2024
work page 2024
- [37]
- [38]
- [39]
-
[40]
Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal un- derstanding. arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 22
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.