Recognition: unknown
AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3
The pith
A 3-billion-parameter vision-language model fine-tuned on mixed synthetic and real Darija data becomes the first open-source OCR system for the Moroccan Arabic dialect.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AtlasOCR is the first open-source Darija OCR model obtained by fine-tuning the Qwen2.5-VL 3B vision-language model on a dataset that combines synthetic images generated by the OCRSmith library with carefully selected real-world examples; after parameter-efficient training the model reaches state-of-the-art results on the new AtlasOCRBench and on KITAB-Bench while remaining robust across both dialectal and standard Arabic scripts.
What carries the argument
Fine-tuning the Qwen2.5-VL 3B vision-language model, which jointly processes image and text inputs, using QLoRA for low-rank adaptation and Unsloth for faster training on a Darija-specific OCR dataset.
If this is right
- Darija documents and images can now be converted to searchable text without relying on proprietary or foreign-language OCR engines.
- A single small model suffices for both the Moroccan dialect and standard Arabic, reducing the need to maintain separate systems.
- The AtlasOCRBench dataset supplies a public standard against which future Darija OCR work can be measured.
- Synthetic data generation can supplement scarce real examples for other low-resource language variants.
Where Pith is reading between the lines
- The same data-curation and fine-tuning recipe could be applied to other Arabic dialects that currently lack OCR support.
- Deployment on modest hardware becomes feasible once the model size stays at 3 billion parameters.
- Integration with mobile cameras could enable real-time translation of Darija signs and menus.
Load-bearing premise
The performance gains come from genuine learning of Darija text patterns rather than from the model merely memorizing the particular distribution of images and fonts in the training and benchmark collections.
What would settle it
A drop in accuracy below the reported state-of-the-art when the model is tested on freshly collected Darija images from sources never used in training or in AtlasOCRBench or KITAB-Bench, such as new street signs or handwritten notes.
Figures
read the original abstract
Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AtlasOCR as the first open-source OCR model for Darija (Moroccan Arabic dialect) by fine-tuning the 3B-parameter Qwen2.5-VL vision-language model with QLoRA and Unsloth. It describes curating a mixed dataset of synthetic images generated via the authors' OCRSmith library plus real-world sources, presents ablation studies on hyperparameters such as LoRA rank and learning rate, and evaluates on the new AtlasOCRBench plus KITAB-Bench, claiming state-of-the-art performance and robustness for both Darija and standard Arabic OCR tasks.
Significance. If the SOTA claims hold after proper validation, the work would fill a clear gap by delivering the first open-source Darija OCR system, enabling practical applications in document processing, accessibility, and cultural preservation for a low-resource dialect. The efficient fine-tuning recipe on a compact VLM and the release of AtlasOCRBench could also serve as a reusable template for other dialect-specific visual-text tasks.
major comments (2)
- [Dataset construction] Dataset construction section: AtlasOCRBench is built using the identical OCRSmith synthetic pipeline employed for training data (plus real-world sources). No quantitative leakage audit—such as n-gram overlap statistics, embedding cosine similarity, or prompt-level separation between train and test splits—is reported. This directly weakens the generalization and robustness claims for a 3B VLM fine-tuned on dialect-specific visual text.
- [Evaluation] Evaluation section: The abstract and results assert SOTA performance that challenges larger models on AtlasOCRBench and KITAB-Bench, yet the manuscript supplies no concrete metrics, baseline tables, error bars, or dataset statistics. Without these, the central performance claim cannot be verified and the ablation studies remain ungrounded.
minor comments (2)
- [Abstract] Abstract: The claim of 'comprehensive ablation studies' is stated without even a one-sentence summary of the key hyperparameter findings or optimal configuration, reducing immediate readability.
- [Introduction] Notation: Acronyms such as VLM, QLoRA, and Unsloth are used without first expansion on their initial appearance, which may hinder readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: AtlasOCRBench is built using the identical OCRSmith synthetic pipeline employed for training data (plus real-world sources). No quantitative leakage audit—such as n-gram overlap statistics, embedding cosine similarity, or prompt-level separation between train and test splits—is reported. This directly weakens the generalization and robustness claims for a 3B VLM fine-tuned on dialect-specific visual text.
Authors: We thank the referee for raising this valid concern about data leakage. The training data primarily consists of synthetic images generated via OCRSmith, while AtlasOCRBench incorporates a substantial portion of real-world sourced images for the test split to promote better generalization. Nevertheless, we agree that the absence of quantitative leakage audits (n-gram overlap, embedding cosine similarity, or explicit prompt-level separation) in the original submission weakens the robustness claims. In the revised manuscript we have added these analyses in the dataset section, reporting low overlap statistics that support the reported generalization performance of the 3B VLM. revision: yes
-
Referee: [Evaluation] Evaluation section: The abstract and results assert SOTA performance that challenges larger models on AtlasOCRBench and KITAB-Bench, yet the manuscript supplies no concrete metrics, baseline tables, error bars, or dataset statistics. Without these, the central performance claim cannot be verified and the ablation studies remain ungrounded.
Authors: The referee is correct that the submitted manuscript did not present sufficient concrete numerical results, full baseline tables, error bars, or dataset statistics to allow independent verification of the SOTA claims. We have substantially expanded the evaluation section in the revision to include explicit performance tables (CER/WER on both benchmarks), comparisons against larger models and prior baselines, ablation tables with exact hyperparameter values, standard deviations from repeated runs, and detailed dataset statistics (sample counts, character distributions, and split sizes). These additions directly ground the performance and ablation claims. revision: yes
Circularity Check
No significant circularity; empirical fine-tuning study with external benchmarks
full rationale
The paper describes an empirical pipeline: curating a Darija dataset via OCRSmith synthetic generation plus real-world sources, fine-tuning Qwen2.5-VL 3B with QLoRA/Unsloth, running hyperparameter ablations, and reporting performance on the authors' new AtlasOCRBench plus the established external KITAB-Bench. No mathematical derivations, equations, or first-principles predictions appear in the provided text. Performance claims are direct empirical measurements rather than quantities derived from fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force results. The derivation chain is therefore self-contained and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank and alpha
- Learning rate and batch size
axioms (1)
- domain assumption Vision-language models pre-trained on general data can be effectively adapted to specialized OCR tasks via parameter-efficient fine-tuning.
invented entities (1)
-
OCRSmith library
no independent evidence
Reference graph
Works this paper leans on
-
[1]
NanoVLM: Ex- ploring Vision-Language Models
Hugging Face Team. NanoVLM: Ex- ploring Vision-Language Models. https: //huggingface.co/blog/nanovlm, 2024. 6
2024
-
[2]
OCRSmith: Synthetic OCR Data Generation Toolkit
AtlasIA Team. OCRSmith: Synthetic OCR Data Generation Toolkit. https://github. com/atlasia-ma/OCRSmith, 2024
2024
-
[3]
Qwen2-VL-2B-Instruct
Qwen Team. Qwen2-VL-2B-Instruct. https://huggingface.co/Qwen/ Qwen2-VL-2B-Instruct , 2024
2024
-
[4]
Qwen2.5-VL-3B-Instruct
Qwen Team. Qwen2.5-VL-3B-Instruct. https://huggingface.co/Qwen/Qwen2. 5-VL-3B-Instruct , 2024
2024
-
[5]
Qari-OCR- v0.3-VL-2B-Instruct
NAMAA-Space Team. Qari-OCR- v0.3-VL-2B-Instruct. https:// huggingface.co/NAMAA-Space/Qari-OCR-v0. 3-VL-2B-Instruct , 2024
2024
-
[6]
ArabicNougat: Ara- bic Document Understanding Model
Mohamed Rashad. ArabicNougat: Ara- bic Document Understanding Model. https://huggingface.co/MohamedRashad/ arabic-large-nougat , 2024
2024
-
[7]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtz- man, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
Unsloth: 5x Faster LLM Fine- tuning
Unsloth Team. Unsloth: 5x Faster LLM Fine- tuning. https://unsloth.ai, 2024
2024
-
[9]
RSLoRA: Rank- Stabilized LoRA for Fine-tuning Large Lan- guage Models
Damjan Kalajdzievski et al. RSLoRA: Rank- Stabilized LoRA for Fine-tuning Large Lan- guage Models. arXiv preprint , 2024
2024
-
[10]
KITAB-Bench: A Com- prehensive Benchmark for Arabic OCR and Document Understanding
Ahmed Attia et al. KITAB-Bench: A Com- prehensive Benchmark for Arabic OCR and Document Understanding. arXiv preprint arXiv:2402.14949, 2024
-
[11]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv preprint arXiv:2403.08295, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.