arxiv: 2605.12623 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.CV· cs.LG

Recognition: unknown

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Abdullah Sohail, Ahmed Heakl, Ahmed Nassar, Fahad Shahbaz Khan, Imran Razzak, Peter W. J. Staar, Rania Elbadry, Salman Khan, Youssef Mohamed

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:58 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.LG

keywords multilingual document understandingOCR datasetsdirect preference optimizationlow-resource languagesdocument layout analysisrendering-based annotationDocAtlas

0 comments

The pith

Direct preference optimization with rendering-derived ground truth improves multilingual document understanding across 82 languages without base-language degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DocAtlas to generate high-fidelity OCR and document-understanding data for 82 languages through two rendering pipelines that produce exact structural labels without using learned models. It shows that direct preference optimization, with the rendered outputs treated as preferred responses, delivers consistent gains on both in-domain and out-of-domain tasks while preserving performance on high-resource languages. Standard supervised fine-tuning, by contrast, produces large drops on out-of-domain data. The resulting DocAtlas-DeepSeek variant outperforms the strongest prior models by 1.7 percent.

Core claim

DocAtlas constructs unified DocTag annotations for layout, text, and component types by differentially rendering native DOCX files and synthetically generating LaTeX-based right-to-left documents. When these rendering-derived labels serve as the positive signal in direct preference optimization, the adapted models improve in-domain accuracy by 1.9 percent and out-of-domain accuracy by 1.8 percent with no measurable degradation on the original language, whereas supervised fine-tuning causes out-of-domain drops of up to 21 percent.

What carries the argument

Dual rendering pipelines that generate precise DocTag structural annotations from native and synthetic documents, used as positive signals in direct preference optimization for stable multilingual adaptation.

If this is right

Preference optimization with rendering labels can replace supervised fine-tuning for multilingual document tasks to avoid out-of-domain degradation.
High-quality training data for OCR and layout analysis becomes available for languages that previously lacked it.
Persistent performance gaps remain across low-resource scripts, indicating where further data or architecture work is still needed.
The DocAtlas-DeepSeek model sets a new reference point 1.7 percent above prior state-of-the-art systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rendering-as-supervision approach could be applied to other structured visual tasks such as table extraction or form understanding.
If rendering pipelines scale to additional file formats, the method might reduce reliance on model-generated labels in many multimodal domains.
Stable adaptation without base-language loss suggests a general strategy for extending vision-language models to new scripts and layouts.

Load-bearing premise

The rendered annotations from DOCX and LaTeX pipelines accurately represent real-world document distributions across all 82 languages without introducing new biases or distribution shifts.

What would settle it

A held-out test set of native scanned documents in several low-resource languages where DocAtlas-trained models show accuracy equal to or below the supervised-fine-tuning baseline.

Figures

Figures reproduced from arXiv: 2605.12623 by Abdullah Sohail, Ahmed Heakl, Ahmed Nassar, Fahad Shahbaz Khan, Imran Razzak, Peter W. J. Staar, Rania Elbadry, Salman Khan, Youssef Mohamed.

**Figure 1.** Figure 1: Overview of the DocAtlas framework. (Left) Global script coverage across 80+ languages spanning 10 writing systems, illustrating the geographical and typological diversity of the corpus. (Right) Cross-lingual transfer performance after DPO training, showing consistent gains in both in-domain and out-of-domain accuracy across major OCR and vision-language models. Abstract Multilingual document understanding… view at source ↗

**Figure 2.** Figure 2: End-to-end data pipelines. We implement two pipelines: a high-fidelity pipeline for native DOCX documents and a synthetic RTL pipeline for underrepresented scripts. The native pipeline extracts, filters, colorizes, and annotates Word files, while the RTL pipeline converts structured inputs (EPUB, HTML, XML) into precisely annotated PDF documents using LaTeX synthesis. over-union (IoU) containment. When com… view at source ↗

**Figure 4.** Figure 4: Overview of the DocAtlas synthetic data generation pipeline. Structured inputs (HTML, XML, DOCX, EPUB) are parsed into DocTag snippets and rendered via LATEX templates with positional logging. Through multiple compilations, the system produces aligned PDF documents and precise element-level annotations (DocTag, Markdown, and visual overlays). 3.3 Benchmark We assembled a multilingual benchmark balancing … view at source ↗

**Figure 5.** Figure 5: Accuracy distribution across high- and lowresource languages [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: OCR accuracy across language families. Scores (brighter is better) show average performance across 14 models and 7 families. Top models (e.g., DeepseekOCR, Chandra) are consistent, while others degrade on low-resource scripts. Arabic Chinese CroatianDutchEnglish FrenchHindi Italian Polish Russian Serbian Spanish Thai Ukrainian Vietnamese Language 0 20 40 60 80 Chart Score (%) DeepseekOCR NanosetsOCR2 Gemin… view at source ↗

**Figure 7.** Figure 7: Chart extraction accuracy across 15 languages. Gemini-2.5-Flash achieves the highest average. Multilingual Chart Extraction Chart extraction reveals a critical divide between specialized OCR systems and general-purpose vision-language models. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: DPO gains across language families. Language family gains reveal typological patterns [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Language frequency distribution in the DocAtlas corpus. The dataset exhibits a long-tailed distribution across 80+ languages, with high-resource scripts (e.g., en, ru, es) dominating the head and low-resource languages (e.g., ps, ckb, ku, azb) forming a diverse tail. 10 3 10 4 10 5 10 6 Frequency text list table heading_1 figure footer heading_2 header heading_3 heading_4 toc title form_tag quote table_cap… view at source ↗

**Figure 10.** Figure 10: Tag frequency distribution in DocAtlas. Download, Safety Filtering, and Annotation [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 12.** Figure 12: Perplexity-based filtering across five languages. David, Narkisim, and Frank Ruehl for Hebrew; Nazanin, Lotus, and Iranian Sans for Persian; and Nastaliq and Naskh for Urdu. Persian templates additionally support mixed LTR/RTL layouts for scientific content. Rendering and Quality Control. The synthesis engine, built on LuaTeX with custom positional logging commands, operates in three compilation passes: … view at source ↗

**Figure 14.** Figure 14: Scores vs. model scale. Each point represents a model; marker size encodes parameter count. Larger models do not uniformly dominate: several compact expert systems (≤3B) match or exceed generalpurpose VLMs on both text and table scores. 8.3 Layout Robustness [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of layout parser failure cases. Common error types (extra, overlapping, missing detections; wrong categories) are shown for Layout Parser (left) and our DocAtlas system (right). DocAtlas consistently produces more accurate and cleaner segmentation. Model Size 100 M 1B 10 B Text 100 B Model Size 100 M 1B 10 B 100 B NanosetsOCR2 Qwen3-VL MinerU2.5 Qwen2.5-VL Granite-Docling SmolDoclin… view at source ↗

**Figure 15.** Figure 15: Document type performance. 9.2 Document Type Evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

read the original abstract

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DocAtlas gives a practical model-free pipeline for 82-language document annotations and shows DPO on rendered ground truth stabilizes adaptation better than SFT.

read the letter

The core thing to know is that this paper builds a dataset and benchmark for document understanding across 82 languages using two rendering pipelines—one from native DOCX files and one synthetic LaTeX route for RTL scripts—then feeds the resulting DocTag annotations into DPO to adapt models without the out-of-domain drops seen in standard fine-tuning. The reported numbers are modest but consistent: +1.9% in-domain and +1.8% out-of-domain with no base-language regression, while SFT can lose up to 21% OOD. Their best model edges the strongest baseline by 1.7% after evaluating 16 existing systems and flagging clear gaps in low-resource scripts. That combination of scale and the adaptation result is the actual new piece; prior work has not paired differential rendering at this language count with DPO on the derived labels. The approach is straightforward and avoids learned annotators, which is a clean way to reduce bias propagation. The evaluation setup looks reproducible on the surface because it relies on external baselines and constructed data rather than self-referential fitting. Soft spots are mainly around validation of the annotations themselves. The stress-test concern about fidelity is fair to raise: without per-language layout IoU, text extraction F1, or human agreement numbers in the main results, it is possible the positive DPO signal partly reflects pipeline artifacts rather than true structure, especially for the synthetic RTL path. The abstract also omits error bars and detailed ablations, so the stability claim needs the full tables to judge robustness. Minor issues include the usual lack of statistical significance tests on the small percentage gains. This work is aimed at researchers building multilingual document systems or low-resource adaptation methods; anyone needing concrete data pipelines or a head-to-head of DPO versus SFT on this task will find usable material. It is coherent on its own terms and engages the literature directly, so it deserves a serious referee even if revisions will be needed on the validation side.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DocAtlas, a framework for high-fidelity multilingual document understanding across 82 languages and 9 tasks. It constructs datasets via dual annotation pipelines—differential rendering of native DOCX documents and synthetic LaTeX generation for RTL scripts—producing model-free DocTag annotations for layout, text, and components. Evaluation of 16 SOTA models reveals gaps in low-resource scripts; DPO using rendering-derived ground truth as the positive signal yields +1.9% in-domain and +1.8% out-of-domain accuracy gains with no base-language degradation, while SFT degrades OOD performance by up to 21%. The best variant (DocAtlas-DeepSeek) improves +1.7% over the strongest baseline.

Significance. If the annotation quality and empirical results hold, the work supplies large-scale multilingual resources and benchmarks that address data scarcity in low-resource languages. The demonstration that DPO enables stable adaptation without the OOD degradation seen in SFT offers a practical method for multilingual document models, with potential impact on OCR, layout analysis, and related tasks.

major comments (2)

[Abstract and data-construction sections] Abstract and data-construction sections: The headline DPO gains (+1.9% in-domain, +1.8% OOD, no base-language drop) rest on the assumption that differential rendering and LaTeX synthetic annotations produce precise, distributionally faithful DocTag ground truth across all 82 languages. No per-language or per-script quantitative validation (layout IoU, text-extraction F1, or human agreement rates) is reported, leaving open the possibility that the positive signal encodes pipeline artifacts rather than genuine structure.
[Evaluation section] Evaluation section: Reported accuracy deltas lack error bars, statistical significance tests, dataset statistics, or ablation studies on the DPO preference pairs, rendering pipeline, or language coverage. Without these, it is difficult to confirm robustness of the claim that DPO is stable while SFT degrades OOD by up to 21%.

minor comments (2)

Define 'in-domain' versus 'out-of-domain' splits explicitly, including how languages and tasks are partitioned.
Clarify the exact composition of the 9 evaluation tasks and their coverage across the 82 languages and scripts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on validation and statistical rigor. We address each major comment below and will incorporate the suggested additions in the revised manuscript to strengthen the claims.

read point-by-point responses

Referee: [Abstract and data-construction sections] Abstract and data-construction sections: The headline DPO gains (+1.9% in-domain, +1.8% OOD, no base-language drop) rest on the assumption that differential rendering and LaTeX synthetic annotations produce precise, distributionally faithful DocTag ground truth across all 82 languages. No per-language or per-script quantitative validation (layout IoU, text-extraction F1, or human agreement rates) is reported, leaving open the possibility that the positive signal encodes pipeline artifacts rather than genuine structure.

Authors: We acknowledge that explicit per-language quantitative validation metrics are not reported in the current version. The differential rendering pipeline is model-free and extracts annotations directly from native DOCX internal structure, while LaTeX generation for RTL uses controlled, deterministic templates; both are designed to be distributionally faithful by construction. To address the concern directly, we will add a new subsection with human agreement rates and quantitative metrics (layout IoU, text F1) on a sampled subset of languages spanning major scripts. This will provide empirical support for annotation precision and help rule out pipeline artifacts. revision: yes
Referee: [Evaluation section] Evaluation section: Reported accuracy deltas lack error bars, statistical significance tests, dataset statistics, or ablation studies on the DPO preference pairs, rendering pipeline, or language coverage. Without these, it is difficult to confirm robustness of the claim that DPO is stable while SFT degrades OOD by up to 21%.

Authors: We agree that error bars, significance testing, dataset statistics, and ablations are needed for full robustness assessment. In the revision we will add error bars to all accuracy figures (computed over multiple seeds), report paired statistical significance tests on the reported deltas, include a table of per-language-group dataset statistics, and provide ablation studies on DPO preference pair construction, key rendering pipeline components, and language coverage subsets. These changes will directly support the stability claims for DPO versus SFT. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs DocAtlas datasets via differential rendering of native DOCX documents and synthetic LaTeX generation for RTL scripts, then reports empirical results from evaluating 16 models and applying DPO with rendering-derived annotations as positive signals. The claimed gains (+1.9% in-domain, +1.8% out-of-domain) and comparison to SFT degradation are measured on held-out evaluation tasks across 82 languages; no equations, fitted parameters, or self-citations reduce any result to its own inputs by construction. Dataset generation is a fixed pipeline choice whose fidelity is asserted but not derived from the performance numbers themselves, leaving the central claims externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that rendered and synthetic documents serve as unbiased high-fidelity ground truth. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Rendered native DOCX and synthetic LaTeX documents produce precise structural annotations that match real document distributions across 82 languages.
This underpins the quality of the ground truth used for both benchmarking and DPO training.

pith-pipeline@v0.9.0 · 5509 in / 1204 out tokens · 40511 ms · 2026-05-14T20:58:47.305197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 10 internal anchors

[1]

FirstName LastName , title =

work page
[2]

FirstName Alpher , title =

work page
[3]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

work page
[4]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

work page
[5]

FirstName Alpher and FirstName Gamow , title =

work page
[6]

General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=

work page arXiv
[7]

olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models , author=. arXiv preprint arXiv:2502.18443 , year=

work page arXiv
[8]

Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm.arXiv preprint arXiv:2506.05218, 2025

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm , author=. arXiv preprint arXiv:2506.05218 , year=

work page arXiv
[9]

2025 , note=

Mistral OCR , author=. 2025 , note=

work page 2025
[10]

DeepSeek-OCR , author=

work page
[11]

arXiv preprint arXiv:2308.13418 , year=

Nougat: Neural Optical Understanding for Academic Documents , author=. arXiv preprint arXiv:2308.13418 , year=

work page arXiv
[12]

2024 , note=

Docling Technical Report , author=. 2024 , note=

work page 2024
[13]

Granite Docling: A 258M-Parameter Multimodal VLM for Document Understanding , author=

work page
[14]

arXiv preprint arXiv:2503.11576 , year=

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. arXiv preprint arXiv:2503.11576 , year=

work page arXiv
[15]

Marker: Fast and Accurate PDF to Markdown Converter , author=

work page
[16]

Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. arXiv preprint arXiv:2409.18839 , year=

work page arXiv
[17]

MinerU 2.5: Advanced Document Understanding with Vision-Language Models , author=

work page
[18]

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=

work page
[19]

Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting , author=. Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[20]

Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page
[21]

Mathpix OCR API , author=

work page
[22]

Pix2Text: An Open-Source Tool for Recognizing Layouts, Tables, Math Formulas, and Text in Images , author=

work page
[23]

OCRFlux: Mastering Complex Layouts and Seamless Page Merging , author=

work page
[24]

Unstructured: Open-Source Pre-Processing Tools for Unstructured Data , author=

work page
[25]

OpenParse: Visually-Driven Document Parser for LLM Ingestion , author=

work page
[26]

2025 , note=

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , note=

work page 2025
[27]

PaddleOCR 3.0 Technical Report

PaddleOCR 3.0 Technical Report , author=. arXiv preprint arXiv:2507.05595 , year=

work page internal anchor Pith review arXiv
[28]

arXiv preprint arXiv:2501.15558 , year=

Ocean-OCR: Towards General OCR Application via a Vision-Language Model , author=. arXiv preprint arXiv:2501.15558 , year=

work page arXiv
[29]

Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion , author=. arXiv preprint arXiv:2509.01215 , year=

work page arXiv
[30]

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

DocVQA: A Dataset for VQA on Document Images , author=. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

work page
[31]

2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , author=. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=. 2019 , organization=. doi:10.1109/ICDARW.2019.10029 , note=

work page doi:10.1109/icdarw.2019.10029 2019
[32]

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=

ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT , author=. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=. 2017 , organization=

work page 2017
[33]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=. 2022 , doi=

work page 2022
[34]

2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=

PubLayNet: Largest Dataset Ever for Document Layout Analysis , author=. 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=. 2019 , organization=. doi:10.1109/ICDAR.2019.00166 , note=

work page doi:10.1109/icdar.2019.00166 2019
[35]

arXiv preprint arXiv:2206.01062 , year=

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis , author=. arXiv preprint arXiv:2206.01062 , year=

work page arXiv
[36]

arXiv preprint arXiv:2305.07895 , year=

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models , author=. arXiv preprint arXiv:2305.07895 , year=

work page arXiv
[37]

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

ICDAR 2019 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Recognition - RRC-MLT-2019 , author=. arXiv preprint arXiv:1907.00945 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

LayoutLM: Pre-training of Text and Layout for Document Image Understanding , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=. 2020 , doi=

work page 2020
[39]

arXiv preprint arXiv:2012.14740 , year=

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2012.14740 , year=

work page arXiv 2012
[40]

Proceedings of the 30th ACM International Conference on Multimedia , pages=

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , author=. Proceedings of the 30th ACM International Conference on Multimedia , pages=. 2022 , note=

work page 2022
[41]

Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86549-8_9 , note=

work page doi:10.1007/978-3-030-86549-8_9 2021
[42]

arXiv preprint arXiv:2103.15992 , year=

A Multiplexed Network for End-to-End, Multilingual OCR , author=. arXiv preprint arXiv:2103.15992 , year=

work page arXiv
[43]

arXiv preprint arXiv:2410.16153 , year=

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages , author=. arXiv preprint arXiv:2410.16153 , year=

work page arXiv
[44]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , note=

work page 2023
[45]

arXiv preprint arXiv:2104.08836 , year=

LayoutXLM: Multimodal Pre-training for Multilingual Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2104.08836 , year=

work page arXiv
[46]

arXiv preprint arXiv:2412.17787 , year=

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective , author=. arXiv preprint arXiv:2412.17787 , year=

work page arXiv
[47]

Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=

SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86337-1_8 , note=

work page doi:10.1007/978-3-030-86337-1_8 2021
[48]

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

A Synthetic Recipe for OCR , author=. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020
[49]

arXiv preprint arXiv:2103.08236 , year=

Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs , author=. arXiv preprint arXiv:2103.08236 , year=

work page arXiv
[50]

Journal of Imaging , volume=

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images , author=. Journal of Imaging , volume=. 2017 , publisher=

work page 2017
[51]

TextRecognitionDataGenerator , author=

work page
[52]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[53]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

AAAI Conference on Artificial Intelligence , year=

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion , author=. AAAI Conference on Artificial Intelligence , year=

work page
[57]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=

work page
[60]

2025 , version =

Datalab To , title =. 2025 , version =

work page 2025
[61]

European conference on computer vision , pages=

Image-based table recognition: data, model, and evaluation , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020
[62]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Readoc: A unified benchmark for realistic document structured extraction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[63]

arXiv preprint arXiv:2006.01038 , year=

Docbank: A benchmark dataset for document layout analysis , author=. arXiv preprint arXiv:2006.01038 , year=

work page arXiv 2006
[64]

arXiv preprint arXiv:2210.05391 , year=

Pp-structurev2: A stronger document analysis system , author=. arXiv preprint arXiv:2210.05391 , year=

work page arXiv
[65]

DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR: Contexts Optical Compression , author=. arXiv preprint arXiv:2510.18234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

European Conference on Computer Vision , pages=

Ocr-free document understanding transformer , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[67]

International Conference on Machine Learning , pages=

Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[68]

Proceedings of the national academy of sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

work page 2017
[69]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

work page
[70]

IEEE Transactions on Audio, Speech and Language Processing , year=

An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=

work page
[71]

arXiv preprint arXiv:2507.06761 , year=

Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu , author=. arXiv preprint arXiv:2507.06761 , year=

work page arXiv
[72]

European Conference on Computer Vision , pages=

Task grouping for multilingual text recognition , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[73]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Kitab-bench: A comprehensive multi-domain benchmark for arabic ocr and document understanding , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[74]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[75]

How to Teach Large Multimodal Models New Skills

How to Teach Large Multimodal Models New Skills , author=. arXiv preprint arXiv:2510.08564 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

, citeulike-article-id =

Bradski, G. , citeulike-article-id =. Dr. Dobb's Journal of Software Tools , keywords =

work page
[77]

2024 , howpublished =

work page 2024
[78]

FastText.zip: Compressing text classification models

Fasttext. zip: Compressing text classification models. arXiv 2016 , author=. arXiv preprint arXiv:1612.03651 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2016
[79]

Proceedings of the 30th ACM international conference on multimedia , pages=

Layoutlmv3: Pre-training for document ai with unified text and image masking , author=. Proceedings of the 30th ACM international conference on multimedia , pages=

work page
[80]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

CCNet: Extracting high quality monolingual datasets from web crawl data , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

work page

Showing first 80 references.