arxiv: 2604.08070 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: unknown

AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

Imane Momayiz , Soufiane Ait Elaouad , Abdeljalil Elmajjodi , Haitame Bouanane

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Darija OCRMoroccan ArabicVision-language modelOptical character recognitionLow-resource languageSynthetic dataFine-tuningArabic script

0 comments

The pith

A 3-billion-parameter vision-language model fine-tuned on mixed synthetic and real Darija data becomes the first open-source OCR system for the Moroccan Arabic dialect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AtlasOCR as the first open-source tool capable of recognizing printed text in Darija. It explains how a dataset was assembled from synthetic generations and real sources, then used to adapt a 3-billion-parameter model through efficient training methods. Tests on a new benchmark and an established one show the resulting system matches or exceeds the accuracy of much larger models on both Darija and standard Arabic text. A reader would care because Darija appears widely in images and documents yet has had no dedicated recognition software, so practical digitization and search become possible for the first time.

Core claim

AtlasOCR is the first open-source Darija OCR model obtained by fine-tuning the Qwen2.5-VL 3B vision-language model on a dataset that combines synthetic images generated by the OCRSmith library with carefully selected real-world examples; after parameter-efficient training the model reaches state-of-the-art results on the new AtlasOCRBench and on KITAB-Bench while remaining robust across both dialectal and standard Arabic scripts.

What carries the argument

Fine-tuning the Qwen2.5-VL 3B vision-language model, which jointly processes image and text inputs, using QLoRA for low-rank adaptation and Unsloth for faster training on a Darija-specific OCR dataset.

If this is right

Darija documents and images can now be converted to searchable text without relying on proprietary or foreign-language OCR engines.
A single small model suffices for both the Moroccan dialect and standard Arabic, reducing the need to maintain separate systems.
The AtlasOCRBench dataset supplies a public standard against which future Darija OCR work can be measured.
Synthetic data generation can supplement scarce real examples for other low-resource language variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-curation and fine-tuning recipe could be applied to other Arabic dialects that currently lack OCR support.
Deployment on modest hardware becomes feasible once the model size stays at 3 billion parameters.
Integration with mobile cameras could enable real-time translation of Darija signs and menus.

Load-bearing premise

The performance gains come from genuine learning of Darija text patterns rather than from the model merely memorizing the particular distribution of images and fonts in the training and benchmark collections.

What would settle it

A drop in accuracy below the reported state-of-the-art when the model is tested on freshly collected Darija images from sources never used in training or in AtlasOCRBench or KITAB-Bench, such as new street signs or handwritten notes.

Figures

Figures reproduced from arXiv: 2604.08070 by Abdeljalil Elmajjodi, Haitame Bouanane, Imane Momayiz, Soufiane Ait Elaouad.

**Figure 1.** Figure 1: Vision Language Model Architecture [1] For OCR applications, this architecture enables understanding both visual text layout and linguistic nuances, crucial for accurately recognizing Darija text across diverse fonts, styles, and backgrounds. 4 Data Curation Developing a robust Darija OCR model required creating a large-scale, diverse dataset reflecting realworld variability. Our hybrid approach combine… view at source ↗

**Figure 3.** Figure 3: illustrates examples from each real-world data source. (a) Scanned Literature (b) Social Media Content (c) Educational Materials (d) Recipe Collection [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Benchmark Creation Pipeline 1. Pseudo-labeling: Using Gemini 2.0 Flash with carefully engineered prompts prioritizing human readability 2. Human Annotation: Manual review and correction using Argilla for collaborative editing The final AtlasOCRBench contains 251 samples, including 55 from scanned literature, ensuring comprehensive coverage of realistic Darija OCR challenges. 7.2 Evaluation Metrics We em… view at source ↗

**Figure 5.** Figure 5: AtlasOCRBench Results (Lower CER indicates better performance) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: KITAB-Bench Results (Lower CER indicates better performance) [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AtlasOCR gives the first open Darija OCR model via VLM fine-tuning, which is a practical step for an under-served dialect, but the SOTA generalization claim is weakened by possible overlap between synthetic training data and the new benchmark.

read the letter

AtlasOCR is the first open-source model aimed at Darija OCR, and that fills a clear gap for a dialect spoken by millions with almost no prior dedicated tools. The authors take Qwen2.5-VL 3B, apply QLoRA and Unsloth for efficient fine-tuning, build a dataset from synthetic images generated by their OCRSmith library plus real-world sources, run ablations on LoRA rank, learning rate, and batch size, and report results on both their AtlasOCRBench and the existing KITAB-Bench. Releasing the model and data openly is the part that actually helps people who need to read Moroccan Arabic text in images today.

Referee Report

2 major / 2 minor

Summary. The paper introduces AtlasOCR as the first open-source OCR model for Darija (Moroccan Arabic dialect) by fine-tuning the 3B-parameter Qwen2.5-VL vision-language model with QLoRA and Unsloth. It describes curating a mixed dataset of synthetic images generated via the authors' OCRSmith library plus real-world sources, presents ablation studies on hyperparameters such as LoRA rank and learning rate, and evaluates on the new AtlasOCRBench plus KITAB-Bench, claiming state-of-the-art performance and robustness for both Darija and standard Arabic OCR tasks.

Significance. If the SOTA claims hold after proper validation, the work would fill a clear gap by delivering the first open-source Darija OCR system, enabling practical applications in document processing, accessibility, and cultural preservation for a low-resource dialect. The efficient fine-tuning recipe on a compact VLM and the release of AtlasOCRBench could also serve as a reusable template for other dialect-specific visual-text tasks.

major comments (2)

[Dataset construction] Dataset construction section: AtlasOCRBench is built using the identical OCRSmith synthetic pipeline employed for training data (plus real-world sources). No quantitative leakage audit—such as n-gram overlap statistics, embedding cosine similarity, or prompt-level separation between train and test splits—is reported. This directly weakens the generalization and robustness claims for a 3B VLM fine-tuned on dialect-specific visual text.
[Evaluation] Evaluation section: The abstract and results assert SOTA performance that challenges larger models on AtlasOCRBench and KITAB-Bench, yet the manuscript supplies no concrete metrics, baseline tables, error bars, or dataset statistics. Without these, the central performance claim cannot be verified and the ablation studies remain ungrounded.

minor comments (2)

[Abstract] Abstract: The claim of 'comprehensive ablation studies' is stated without even a one-sentence summary of the key hyperparameter findings or optimal configuration, reducing immediate readability.
[Introduction] Notation: Acronyms such as VLM, QLoRA, and Unsloth are used without first expansion on their initial appearance, which may hinder readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: AtlasOCRBench is built using the identical OCRSmith synthetic pipeline employed for training data (plus real-world sources). No quantitative leakage audit—such as n-gram overlap statistics, embedding cosine similarity, or prompt-level separation between train and test splits—is reported. This directly weakens the generalization and robustness claims for a 3B VLM fine-tuned on dialect-specific visual text.

Authors: We thank the referee for raising this valid concern about data leakage. The training data primarily consists of synthetic images generated via OCRSmith, while AtlasOCRBench incorporates a substantial portion of real-world sourced images for the test split to promote better generalization. Nevertheless, we agree that the absence of quantitative leakage audits (n-gram overlap, embedding cosine similarity, or explicit prompt-level separation) in the original submission weakens the robustness claims. In the revised manuscript we have added these analyses in the dataset section, reporting low overlap statistics that support the reported generalization performance of the 3B VLM. revision: yes
Referee: [Evaluation] Evaluation section: The abstract and results assert SOTA performance that challenges larger models on AtlasOCRBench and KITAB-Bench, yet the manuscript supplies no concrete metrics, baseline tables, error bars, or dataset statistics. Without these, the central performance claim cannot be verified and the ablation studies remain ungrounded.

Authors: The referee is correct that the submitted manuscript did not present sufficient concrete numerical results, full baseline tables, error bars, or dataset statistics to allow independent verification of the SOTA claims. We have substantially expanded the evaluation section in the revision to include explicit performance tables (CER/WER on both benchmarks), comparisons against larger models and prior baselines, ablation tables with exact hyperparameter values, standard deviations from repeated runs, and detailed dataset statistics (sample counts, character distributions, and split sizes). These additions directly ground the performance and ablation claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning study with external benchmarks

full rationale

The paper describes an empirical pipeline: curating a Darija dataset via OCRSmith synthetic generation plus real-world sources, fine-tuning Qwen2.5-VL 3B with QLoRA/Unsloth, running hyperparameter ablations, and reporting performance on the authors' new AtlasOCRBench plus the established external KITAB-Bench. No mathematical derivations, equations, or first-principles predictions appear in the provided text. Performance claims are direct empirical measurements rather than quantities derived from fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force results. The derivation chain is therefore self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions about VLM transfer learning for OCR and introduces new artifacts (dataset, library, benchmarks) whose quality is asserted but not independently verified here.

free parameters (2)

LoRA rank and alpha
Hyperparameters selected via ablation studies to balance efficiency and performance during fine-tuning of the 3B model.
Learning rate and batch size
Tuned during training with Unsloth and QLoRA; specific values not stated in abstract but critical to reported results.

axioms (1)

domain assumption Vision-language models pre-trained on general data can be effectively adapted to specialized OCR tasks via parameter-efficient fine-tuning.
Invoked implicitly when claiming that fine-tuning Qwen2.5-VL 3B yields robust Darija OCR.

invented entities (1)

OCRSmith library no independent evidence
purpose: Synthetic generation of Darija text images for dataset creation.
New tool introduced to supplement real-world data; no external validation of its output quality provided in abstract.

pith-pipeline@v0.9.0 · 5464 in / 1546 out tokens · 63006 ms · 2026-05-10T17:36:20.522409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages · 2 internal anchors

[1]

NanoVLM: Ex- ploring Vision-Language Models

Hugging Face Team. NanoVLM: Ex- ploring Vision-Language Models. https: //huggingface.co/blog/nanovlm, 2024. 6

2024
[2]

OCRSmith: Synthetic OCR Data Generation Toolkit

AtlasIA Team. OCRSmith: Synthetic OCR Data Generation Toolkit. https://github. com/atlasia-ma/OCRSmith, 2024

2024
[3]

Qwen2-VL-2B-Instruct

Qwen Team. Qwen2-VL-2B-Instruct. https://huggingface.co/Qwen/ Qwen2-VL-2B-Instruct , 2024

2024
[4]

Qwen2.5-VL-3B-Instruct

Qwen Team. Qwen2.5-VL-3B-Instruct. https://huggingface.co/Qwen/Qwen2. 5-VL-3B-Instruct , 2024

2024
[5]

Qari-OCR- v0.3-VL-2B-Instruct

NAMAA-Space Team. Qari-OCR- v0.3-VL-2B-Instruct. https:// huggingface.co/NAMAA-Space/Qari-OCR-v0. 3-VL-2B-Instruct , 2024

2024
[6]

ArabicNougat: Ara- bic Document Understanding Model

Mohamed Rashad. ArabicNougat: Ara- bic Document Understanding Model. https://huggingface.co/MohamedRashad/ arabic-large-nougat , 2024

2024
[7]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtz- man, and Luke Zettlemoyer. QLoRA: Eﬀicient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review arXiv 2023
[8]

Unsloth: 5x Faster LLM Fine- tuning

Unsloth Team. Unsloth: 5x Faster LLM Fine- tuning. https://unsloth.ai, 2024

2024
[9]

RSLoRA: Rank- Stabilized LoRA for Fine-tuning Large Lan- guage Models

Damjan Kalajdzievski et al. RSLoRA: Rank- Stabilized LoRA for Fine-tuning Large Lan- guage Models. arXiv preprint , 2024

2024
[10]

KITAB-Bench: A Com- prehensive Benchmark for Arabic OCR and Document Understanding

Ahmed Attia et al. KITAB-Bench: A Com- prehensive Benchmark for Arabic OCR and Document Understanding. arXiv preprint arXiv:2402.14949, 2024

work page arXiv 2024
[11]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv preprint arXiv:2403.08295, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024