Recognition: unknown
DocAtlas: Multilingual Document Understanding Across 80+ Languages
Pith reviewed 2026-05-14 20:58 UTC · model grok-4.3
The pith
Direct preference optimization with rendering-derived ground truth improves multilingual document understanding across 82 languages without base-language degradation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DocAtlas constructs unified DocTag annotations for layout, text, and component types by differentially rendering native DOCX files and synthetically generating LaTeX-based right-to-left documents. When these rendering-derived labels serve as the positive signal in direct preference optimization, the adapted models improve in-domain accuracy by 1.9 percent and out-of-domain accuracy by 1.8 percent with no measurable degradation on the original language, whereas supervised fine-tuning causes out-of-domain drops of up to 21 percent.
What carries the argument
Dual rendering pipelines that generate precise DocTag structural annotations from native and synthetic documents, used as positive signals in direct preference optimization for stable multilingual adaptation.
If this is right
- Preference optimization with rendering labels can replace supervised fine-tuning for multilingual document tasks to avoid out-of-domain degradation.
- High-quality training data for OCR and layout analysis becomes available for languages that previously lacked it.
- Persistent performance gaps remain across low-resource scripts, indicating where further data or architecture work is still needed.
- The DocAtlas-DeepSeek model sets a new reference point 1.7 percent above prior state-of-the-art systems.
Where Pith is reading between the lines
- The same rendering-as-supervision approach could be applied to other structured visual tasks such as table extraction or form understanding.
- If rendering pipelines scale to additional file formats, the method might reduce reliance on model-generated labels in many multimodal domains.
- Stable adaptation without base-language loss suggests a general strategy for extending vision-language models to new scripts and layouts.
Load-bearing premise
The rendered annotations from DOCX and LaTeX pipelines accurately represent real-world document distributions across all 82 languages without introducing new biases or distribution shifts.
What would settle it
A held-out test set of native scanned documents in several low-resource languages where DocAtlas-trained models show accuracy equal to or below the supervised-fine-tuning baseline.
Figures
read the original abstract
Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DocAtlas, a framework for high-fidelity multilingual document understanding across 82 languages and 9 tasks. It constructs datasets via dual annotation pipelines—differential rendering of native DOCX documents and synthetic LaTeX generation for RTL scripts—producing model-free DocTag annotations for layout, text, and components. Evaluation of 16 SOTA models reveals gaps in low-resource scripts; DPO using rendering-derived ground truth as the positive signal yields +1.9% in-domain and +1.8% out-of-domain accuracy gains with no base-language degradation, while SFT degrades OOD performance by up to 21%. The best variant (DocAtlas-DeepSeek) improves +1.7% over the strongest baseline.
Significance. If the annotation quality and empirical results hold, the work supplies large-scale multilingual resources and benchmarks that address data scarcity in low-resource languages. The demonstration that DPO enables stable adaptation without the OOD degradation seen in SFT offers a practical method for multilingual document models, with potential impact on OCR, layout analysis, and related tasks.
major comments (2)
- [Abstract and data-construction sections] Abstract and data-construction sections: The headline DPO gains (+1.9% in-domain, +1.8% OOD, no base-language drop) rest on the assumption that differential rendering and LaTeX synthetic annotations produce precise, distributionally faithful DocTag ground truth across all 82 languages. No per-language or per-script quantitative validation (layout IoU, text-extraction F1, or human agreement rates) is reported, leaving open the possibility that the positive signal encodes pipeline artifacts rather than genuine structure.
- [Evaluation section] Evaluation section: Reported accuracy deltas lack error bars, statistical significance tests, dataset statistics, or ablation studies on the DPO preference pairs, rendering pipeline, or language coverage. Without these, it is difficult to confirm robustness of the claim that DPO is stable while SFT degrades OOD by up to 21%.
minor comments (2)
- Define 'in-domain' versus 'out-of-domain' splits explicitly, including how languages and tasks are partitioned.
- Clarify the exact composition of the 9 evaluation tasks and their coverage across the 82 languages and scripts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on validation and statistical rigor. We address each major comment below and will incorporate the suggested additions in the revised manuscript to strengthen the claims.
read point-by-point responses
-
Referee: [Abstract and data-construction sections] Abstract and data-construction sections: The headline DPO gains (+1.9% in-domain, +1.8% OOD, no base-language drop) rest on the assumption that differential rendering and LaTeX synthetic annotations produce precise, distributionally faithful DocTag ground truth across all 82 languages. No per-language or per-script quantitative validation (layout IoU, text-extraction F1, or human agreement rates) is reported, leaving open the possibility that the positive signal encodes pipeline artifacts rather than genuine structure.
Authors: We acknowledge that explicit per-language quantitative validation metrics are not reported in the current version. The differential rendering pipeline is model-free and extracts annotations directly from native DOCX internal structure, while LaTeX generation for RTL uses controlled, deterministic templates; both are designed to be distributionally faithful by construction. To address the concern directly, we will add a new subsection with human agreement rates and quantitative metrics (layout IoU, text F1) on a sampled subset of languages spanning major scripts. This will provide empirical support for annotation precision and help rule out pipeline artifacts. revision: yes
-
Referee: [Evaluation section] Evaluation section: Reported accuracy deltas lack error bars, statistical significance tests, dataset statistics, or ablation studies on the DPO preference pairs, rendering pipeline, or language coverage. Without these, it is difficult to confirm robustness of the claim that DPO is stable while SFT degrades OOD by up to 21%.
Authors: We agree that error bars, significance testing, dataset statistics, and ablations are needed for full robustness assessment. In the revision we will add error bars to all accuracy figures (computed over multiple seeds), report paired statistical significance tests on the reported deltas, include a table of per-language-group dataset statistics, and provide ablation studies on DPO preference pair construction, key rendering pipeline components, and language coverage subsets. These changes will directly support the stability claims for DPO versus SFT. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper constructs DocAtlas datasets via differential rendering of native DOCX documents and synthetic LaTeX generation for RTL scripts, then reports empirical results from evaluating 16 models and applying DPO with rendering-derived annotations as positive signals. The claimed gains (+1.9% in-domain, +1.8% out-of-domain) and comparison to SFT degradation are measured on held-out evaluation tasks across 82 languages; no equations, fitted parameters, or self-citations reduce any result to its own inputs by construction. Dataset generation is a fixed pipeline choice whose fidelity is asserted but not derived from the performance numbers themselves, leaving the central claims externally falsifiable via the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rendered native DOCX and synthetic LaTeX documents produce precise structural annotations that match real document distributions across 82 languages.
Reference graph
Works this paper leans on
-
[1]
FirstName LastName , title =
-
[2]
FirstName Alpher , title =
-
[3]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[4]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[5]
FirstName Alpher and FirstName Gamow , title =
-
[6]
General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=
-
[7]
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models , author=. arXiv preprint arXiv:2502.18443 , year=
-
[8]
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm , author=. arXiv preprint arXiv:2506.05218 , year=
- [9]
-
[10]
DeepSeek-OCR , author=
-
[11]
arXiv preprint arXiv:2308.13418 , year=
Nougat: Neural Optical Understanding for Academic Documents , author=. arXiv preprint arXiv:2308.13418 , year=
- [12]
-
[13]
Granite Docling: A 258M-Parameter Multimodal VLM for Document Understanding , author=
-
[14]
arXiv preprint arXiv:2503.11576 , year=
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. arXiv preprint arXiv:2503.11576 , year=
-
[15]
Marker: Fast and Accurate PDF to Markdown Converter , author=
-
[16]
MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. arXiv preprint arXiv:2409.18839 , year=
-
[17]
MinerU 2.5: Advanced Document Understanding with Vision-Language Models , author=
-
[18]
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model , author=
-
[19]
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting , author=. Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL) , year=
-
[20]
Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=
-
[21]
Mathpix OCR API , author=
-
[22]
Pix2Text: An Open-Source Tool for Recognizing Layouts, Tables, Math Formulas, and Text in Images , author=
-
[23]
OCRFlux: Mastering Complex Layouts and Seamless Page Merging , author=
-
[24]
Unstructured: Open-Source Pre-Processing Tools for Unstructured Data , author=
-
[25]
OpenParse: Visually-Driven Document Parser for LLM Ingestion , author=
-
[26]
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model , author=. 2025 , note=
work page 2025
-
[27]
PaddleOCR 3.0 Technical Report
PaddleOCR 3.0 Technical Report , author=. arXiv preprint arXiv:2507.05595 , year=
work page internal anchor Pith review arXiv
-
[28]
arXiv preprint arXiv:2501.15558 , year=
Ocean-OCR: Towards General OCR Application via a Vision-Language Model , author=. arXiv preprint arXiv:2501.15558 , year=
-
[29]
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion , author=. arXiv preprint arXiv:2509.01215 , year=
-
[30]
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=
DocVQA: A Dataset for VQA on Document Images , author=. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=
-
[31]
2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , author=. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) , volume=. 2019 , organization=. doi:10.1109/ICDARW.2019.10029 , note=
-
[32]
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=
ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT , author=. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume=. 2017 , organization=
work page 2017
-
[33]
Findings of the Association for Computational Linguistics: ACL 2022 , pages=
XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=. 2022 , doi=
work page 2022
-
[34]
2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=
PubLayNet: Largest Dataset Ever for Document Layout Analysis , author=. 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages=. 2019 , organization=. doi:10.1109/ICDAR.2019.00166 , note=
-
[35]
arXiv preprint arXiv:2206.01062 , year=
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis , author=. arXiv preprint arXiv:2206.01062 , year=
-
[36]
arXiv preprint arXiv:2305.07895 , year=
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models , author=. arXiv preprint arXiv:2305.07895 , year=
-
[37]
ICDAR 2019 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Recognition - RRC-MLT-2019 , author=. arXiv preprint arXiv:1907.00945 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[38]
LayoutLM: Pre-training of Text and Layout for Document Image Understanding , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=. 2020 , doi=
work page 2020
-
[39]
arXiv preprint arXiv:2012.14740 , year=
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2012.14740 , year=
-
[40]
Proceedings of the 30th ACM International Conference on Multimedia , pages=
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , author=. Proceedings of the 30th ACM International Conference on Multimedia , pages=. 2022 , note=
work page 2022
-
[41]
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part I 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86549-8_9 , note=
-
[42]
arXiv preprint arXiv:2103.15992 , year=
A Multiplexed Network for End-to-End, Multilingual OCR , author=. arXiv preprint arXiv:2103.15992 , year=
-
[43]
arXiv preprint arXiv:2410.16153 , year=
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages , author=. arXiv preprint arXiv:2410.16153 , year=
-
[44]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , note=
work page 2023
-
[45]
arXiv preprint arXiv:2104.08836 , year=
LayoutXLM: Multimodal Pre-training for Multilingual Visually-Rich Document Understanding , author=. arXiv preprint arXiv:2104.08836 , year=
-
[46]
arXiv preprint arXiv:2412.17787 , year=
Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective , author=. arXiv preprint arXiv:2412.17787 , year=
-
[47]
SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models , author=. Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 , pages=. 2021 , organization=. doi:10.1007/978-3-030-86337-1_8 , note=
-
[48]
A Synthetic Recipe for OCR , author=. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=
work page 2020
-
[49]
arXiv preprint arXiv:2103.08236 , year=
Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs , author=. arXiv preprint arXiv:2103.08236 , year=
-
[50]
DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images , author=. Journal of Imaging , volume=. 2017 , publisher=
work page 2017
-
[51]
TextRecognitionDataGenerator , author=
-
[52]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[53]
Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
AAAI Conference on Artificial Intelligence , year=
Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion , author=. AAAI Conference on Artificial Intelligence , year=
-
[57]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging , author=
- [60]
-
[61]
European conference on computer vision , pages=
Image-based table recognition: data, model, and evaluation , author=. European conference on computer vision , pages=. 2020 , organization=
work page 2020
-
[62]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Readoc: A unified benchmark for realistic document structured extraction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[63]
arXiv preprint arXiv:2006.01038 , year=
Docbank: A benchmark dataset for document layout analysis , author=. arXiv preprint arXiv:2006.01038 , year=
-
[64]
arXiv preprint arXiv:2210.05391 , year=
Pp-structurev2: A stronger document analysis system , author=. arXiv preprint arXiv:2210.05391 , year=
-
[65]
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR: Contexts Optical Compression , author=. arXiv preprint arXiv:2510.18234 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
European Conference on Computer Vision , pages=
Ocr-free document understanding transformer , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[67]
International Conference on Machine Learning , pages=
Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[68]
Proceedings of the national academy of sciences , volume=
Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=
work page 2017
- [69]
-
[70]
IEEE Transactions on Audio, Speech and Language Processing , year=
An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=
-
[71]
arXiv preprint arXiv:2507.06761 , year=
Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu , author=. arXiv preprint arXiv:2507.06761 , year=
-
[72]
European Conference on Computer Vision , pages=
Task grouping for multilingual text recognition , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[73]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Kitab-bench: A comprehensive multi-domain benchmark for arabic ocr and document understanding , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[74]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[75]
How to Teach Large Multimodal Models New Skills
How to Teach Large Multimodal Models New Skills , author=. arXiv preprint arXiv:2510.08564 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Bradski, G. , citeulike-article-id =. Dr. Dobb's Journal of Software Tools , keywords =
-
[77]
2024 , howpublished =
work page 2024
-
[78]
FastText.zip: Compressing text classification models
Fasttext. zip: Compressing text classification models. arXiv 2016 , author=. arXiv preprint arXiv:1612.03651 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[79]
Proceedings of the 30th ACM international conference on multimedia , pages=
Layoutlmv3: Pre-training for document ai with unified text and image masking , author=. Proceedings of the 30th ACM international conference on multimedia , pages=
-
[80]
Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=
CCNet: Extracting high quality monolingual datasets from web crawl data , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.