METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition
Pith reviewed 2026-06-29 18:46 UTC · model grok-4.3
The pith
METATR is a multilingual evolving benchmark for evaluating automatic text recognition on diverse real-world documents in 29 languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
METATR (v1.0) introduces a dataset from various public collections covering 29 languages with multiple scripts and layouts, along with a standardized prompting and normalization methodology and a dynamic evaluation framework intended to produce reproducible results while remaining extensible over time, allowing for meaningful model comparison and selection in real-world conditions.
What carries the argument
The METATR benchmark dataset combined with its standardized prompting, normalization methodology, and dynamic evaluation framework for multilingual ATR assessment.
If this is right
- Practitioners can select ATR models based on performance for specific languages, scripts, or layouts.
- Progress in the field can be tracked as new models and document types are added to the evolving benchmark.
- Variability in model performance across different document types is quantified for better understanding of limitations.
- Both open-source and closed-source models can be compared under the same standardized conditions.
- Computational efficiency is reported alongside accuracy to inform deployment decisions.
Where Pith is reading between the lines
- Future work might integrate METATR with other benchmarks to create even broader evaluations.
- The emphasis on real-world conditions could encourage development of more robust vLLMs for handwritten and varied layout documents.
- Language-level performance reporting might help prioritize improvements in underrepresented scripts.
- Extending the benchmark to include more evolving elements like user-submitted documents could enhance its relevance.
Load-bearing premise
Documents selected from various public collections adequately represent the complexity and diversity of real-world documents across 29 languages, scripts, and layouts.
What would settle it
If adding new documents to the benchmark changes the relative rankings of models in ways that do not match real-world application outcomes, the representativeness of the selection would be questioned.
Figures
read the original abstract
Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces METATR (v1.0), a multilingual evolving benchmark for Automatic Text Recognition (ATR) that selects documents from public collections to cover 29 languages with multiple scripts and layouts, aiming to maximize diversity and real-world representativeness. It defines a standardized prompting and normalization methodology plus a dynamic evaluation framework, evaluates a range of open- and closed-source vLLM and other ATR systems, and reports results showing proprietary models achieve the most consistent performance while substantial variability remains across scripts and layouts.
Significance. If the document selection and coverage can be shown to adequately capture real-world multilingual ATR complexity, METATR would supply a practitioner-oriented, extensible framework for model comparison and selection that addresses the English-centric bias of prior benchmarks.
major comments (2)
- [Abstract] Abstract: the central claim that the benchmark 'maximizes diversity' and 'represent[s] the complexity of real-world documents' across 29 languages/scripts/layouts rests on selection from 'various public collections' but supplies no explicit sampling rules, stratification criteria, diversity metrics (e.g., script entropy, layout-type coverage, degradation distribution), or comparison against any reference corpus of real-world documents; this unverified premise directly undermines the multidimensional framework claim.
- [Evaluation] Evaluation section (implied by results reporting): the reported performance variability across scripts and layouts is presented without accompanying error analysis, per-language document counts, or robustness statistics that would allow readers to assess whether observed differences reflect genuine benchmark properties rather than sampling artifacts.
minor comments (1)
- [Abstract] The abstract states the benchmark is 'evolving' and 'dynamic' but does not specify the versioning or update mechanism that would enable reproducible tracking of progress over time.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and note planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the benchmark 'maximizes diversity' and 'represent[s] the complexity of real-world documents' across 29 languages/scripts/layouts rests on selection from 'various public collections' but supplies no explicit sampling rules, stratification criteria, diversity metrics (e.g., script entropy, layout-type coverage, degradation distribution), or comparison against any reference corpus of real-world documents; this unverified premise directly undermines the multidimensional framework claim.
Authors: We agree that the manuscript does not supply explicit sampling rules, stratification criteria, or quantitative diversity metrics, nor does it compare the collection against a reference corpus. Document selection was performed by drawing from multiple public collections to achieve coverage of 29 languages, varied scripts, and layouts while favoring real-world documents; however, this process was not formalized with the metrics mentioned. We will revise the abstract and add a dedicated section describing the curation approach, provide per-language document counts, and explicitly discuss the limitations of the current selection procedure with respect to verifiable diversity maximization. revision: yes
-
Referee: [Evaluation] Evaluation section (implied by results reporting): the reported performance variability across scripts and layouts is presented without accompanying error analysis, per-language document counts, or robustness statistics that would allow readers to assess whether observed differences reflect genuine benchmark properties rather than sampling artifacts.
Authors: The manuscript reports aggregate and language-level results together with some robustness observations for handwritten documents, but we concur that the absence of per-language document counts, detailed error analysis, and additional robustness statistics limits the ability to interpret the reported variability. We will expand the evaluation section to include a table of document counts per language, basic error breakdowns for representative scripts and layouts, and further statistical summaries of robustness. revision: yes
Circularity Check
No circularity: benchmark paper contains no derivations or fitted predictions.
full rationale
METATR introduces a dataset and evaluation protocol with no equations, parameter fits, or predictions. The text asserts diversity maximization via selection from public collections but supplies no quantitative derivation, sampling formula, or self-referential reduction that could qualify as circular under the enumerated patterns. No self-citation chains, ansatzes, or uniqueness theorems appear in the provided sections. The central claim rests on qualitative dataset construction rather than any input-to-output equivalence by construction, making a score of 0 the appropriate finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Anthropic: Claude Opus 4.5 (2025), https://www.anthropic.com/news/ claude-opus-4-5 [Accessed: 2025-12]
2025
-
[2]
Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840
Beyer, Y., Solberg, P.E.: Norhand v3 / dataset for handwritten text recognition in norwegian (Dec 2023). https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840
-
[4]
In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing
Boillet, M., Bonhomme, M.L., Stutzmann, D., Kermorvant, C.: Horae: an anno- tated dataset of books of hours. In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing. p. 7–12. HIP ’19 (2019). https://doi.org/10.1145/3352631.3352633
-
[5]
In: International Conference on Pattern Recognition (2022)
Cascianelli, S., Pippi, V., Martin, M., Cornia, M., Baraldi, L., Christopher, K., Cucchiara, R.: The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition. In: International Conference on Pattern Recognition (2022)
2022
-
[6]
https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158
Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Bree, S., Merveille, F.: POPP Datasets: Datasets for handwriting recognition from French population census (Mar 2022). https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158
-
[7]
Journal of Documentation81(7), 334–354 (2025)
Crosilla, G., Klic, L., Colavizza, G.: Benchmarking large language models for handwritten text recognition. Journal of Documentation81(7), 334–354 (2025). https://doi.org/https://doi.org/10.1108/JD-03-2025-0082
-
[8]
Dolfing, H.J., Bellegarda, J., Chorowski, J., Marxer, R., Laurent, A.: The “scrib- blelens” dutch historical handwriting corpus. In: 2020 17th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 67–72 (2020). https://doi.org/10.1109/ICFHR2020.2020.00023
-
[9]
GoogleDeepMind:Gemini3Pro(2025),https://deepmind.google/models/gemini/ [Accessed: 2026-01]
2025
-
[10]
In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition
Grosicki, E., Carre, M., Brodin, J.M., Geoffrois, E.: Results of the rimes evaluation campaign for handwritten mail processing. In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition. p. 941–945. ICDAR ’09 (2009). https://doi.org/10.1109/ICDAR.2009.224
-
[11]
https://doi.org/doi.org/10.23636/1135
Keinan-Schoonbaert, A.: Automatic transcription of historical handwritten arabic texts (2019). https://doi.org/doi.org/10.23636/1135
-
[12]
In: Document Analysis and Recognition - ICDAR 2021
Kodym, O., Hradiš, M.: Page layout analysis system for unconstrained historic documents. In: Document Analysis and Recognition - ICDAR 2021. pp. 492–506 (2021)
2021
-
[13]
In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
Liu, C.L., Yin, F., Wang, D.H., Wang, Q.: Casia online and offline chinese hand- writing databases. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. pp. 37 – 41 (10 2011). https://doi.org/10.1109/ ICDAR.2011.17
2011
-
[14]
Science China Information Sciences67(12) (Dec 2024)
Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: OCRBench: On the hidden mystery of OCR in large multimodal models. Science China Information Sciences67(12) (Dec 2024). https://doi.org/10.1007/ s11432-024-4235-6 A Multilingual Benchmark for ATR 17
2024
-
[15]
International Journal on Document Analysis and Recognition5, 39–46 (11 2002)
Marti, U.V., Bunke, H.: The iam-database: An english sentence database for of- fline handwriting recognition. International Journal on Document Analysis and Recognition5, 39–46 (11 2002). https://doi.org/10.1007/s100320200071
-
[16]
com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]
Microsoft: Azure Document Intelligence layout (2024), https://learn.microsoft. com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]
2024
-
[17]
com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]
Microsoft: Azure Optical Character Recognition (2024), https://learn.microsoft. com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]
2024
-
[18]
Mistral AI: Mistral Large 3 (2025), https://mistral.ai/news/mistral-3 [Accessed: 2026-01]
2025
-
[19]
Mistral AI: Mistral Medium 3 (2025), https://mistral.ai/news/mistral-medium-3 [Accessed: 2026-01]
2025
-
[20]
Mistral AI: Mistral OCR (2025), https://mistral.ai/news/mistral-ocr [Accessed: 2026-01]
2025
-
[21]
Mistral AI: Mistral Small 3.1 (2025), https://mistral.ai/news/mistral-small-3-1 [Accessed: 2026-01]
2025
-
[22]
OpenAI: GPT-5.1 (2025), https://openai.com/index/gpt-5-1/ [Accessed: 2025-12]
2025
-
[23]
arXiv preprint arXiv:2510.19817 (2025)
Poznanski, J., Soldaini, L., Lo, K.: olmOCR 2: Unit Test Rewards for Document OCR. arXiv preprint arXiv:2510.19817 (2025)
-
[24]
Reducto AI: RolmOCR: A Faster, Lighter Open Source OCR Model (2025)
2025
-
[25]
https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464
Romanov, M., Seydi, M.: Openiti: a machine-readable corpus of islamicate texts (May 2019). https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464
-
[26]
Pattern Recognition46(6), 1658–1669 (2013)
Romero, V., Fornés, A., Serrano, N., Sánchez, J.A., Toselli, A.H., Frinken, V., Vidal, E., Lladós, J.: The esposalles database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition46(6), 1658–1669 (2013). https://doi.org/https://doi.org/10.1016/j.patcog.2012.11.024
-
[27]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Semnani, S., Zhang, H., He, X., Tekgurler, M., Lam, M.: CHURRO: Making his- tory readable with an open-weight large vision-language model for high-accuracy, low-cost historical text recognition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 34777–34824 (Nov 2025). https://doi.org/10.18653/v1/2025.emnlp-main.1763
-
[28]
In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR)
Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr2016 competition on hand- written text recognition on the read dataset. In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 630–635 (2016). https://doi.org/10.1109/ICFHR.2016.0120
-
[29]
LightOnOCR: A 1b end-to-end multilingual vision-language model for state-of-the-art OCR,
Taghadouini, S., Cavaillès, A., Aubertin, B.: LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR. arXiv preprint arXiv:2601.14251 (2026)
-
[30]
DeepSeek-OCR: Contexts Optical Compression
Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
DeepSeek-OCR 2: Visual causal flow,
Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR 2: Visual Causal Flow. arXiv preprint arXiv:2601.20552 (2026)
-
[32]
In: Yin, X.C., Karatzas, D., Lopresti, D
Wolf, F., Tüselmann, O., Matei, A., Hennies, L., Rass, C., Fink, G.A.: CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Lan- guage Models. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recognition – ICDAR 2025. pp. 23–39 (2026)
2025
-
[33]
arXiv preprint arXiv:2412.02210 (2024)
Yang, Z., Tang, J., Li, Z., Wang, P., Wan, J., Zhong, H., Liu, X., Yang, M., Wang, P., Bai, S., Jin, L., Lin, J.: CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy. arXiv preprint arXiv:2412.02210 (2024)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.