TrOCR for Medieval HTR: A Systematic Ablation Study with Cross-Dataset Validation

Federico Simonetta; Michele Flammini; Sachin Sharma

arxiv: 2606.24302 · v1 · pith:GKVECZ5Unew · submitted 2026-06-23 · 💻 cs.CV · cs.DL

TrOCR for Medieval HTR: A Systematic Ablation Study with Cross-Dataset Validation

Sachin Sharma , Michele Flammini , Federico Simonetta This is my paper

Pith reviewed 2026-06-26 00:53 UTC · model grok-4.3

classification 💻 cs.CV cs.DL

keywords TrOCRhandwritten text recognitionmedieval manuscriptslayer freezingablation studydomain adaptationcharacter error rate

0 comments

The pith

Freezing up to three encoder layers or six decoder layers in TrOCR preserves accuracy on medieval manuscripts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how contrast normalization, data augmentation, and layer freezing influence the adaptation of a pre-trained transformer model to handwritten text in 13th-century manuscripts. Controlled experiments compare thirteen configurations on one private Italian manuscript and replicate them on the public READ-16 benchmark. Results indicate that moderate freezing of early layers maintains character error rates near the best observed value of 8.03 percent, while deeper freezing progressively increases errors. Removing contrast normalization produces 7.84 percent CER, matching a domain-specialized baseline. Decoder-layer freezing thresholds transfer more reliably across the two datasets than encoder thresholds.

Core claim

Systematic ablation shows that freezing the first three encoder layers or the first six decoder layers produces no statistically significant accuracy loss on the Cortonese manuscript, while removing CLAHE contrast normalization yields 7.84 percent CER comparable to specialized baselines; cross-dataset replication on READ-16 confirms that decoder freezing strategies transfer more robustly and that combined freezing requires dataset-specific checks.

What carries the argument

Controlled ablation grid over contrast normalization, data augmentation, and selective layer freezing in TrOCR encoder and decoder, evaluated by character error rate with statistical comparisons and attention-map diagnostics.

If this is right

Moderate encoder or decoder freezing reduces training cost without harming recognition accuracy.
Strong optimization can achieve competitive CER without contrast normalization preprocessing.
Decoder freezing thresholds generalize better across datasets than encoder thresholds.
Grad-CAM and cross-attention maps reveal error patterns tied to specific ablation choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed freezing thresholds could reduce the compute needed when adapting similar vision transformers to other low-resource historical scripts.
If decoder layers prove more robust to freezing in additional domains, future work could prioritize decoder-side parameter-efficient tuning.
The finding that preprocessing can sometimes be dropped suggests testing whether end-to-end optimization alone suffices on even smaller or more degraded manuscript collections.

Load-bearing premise

The statistical tests across the thirteen configurations are reliable and the Cortonese plus READ-16 datasets are representative enough for the freezing and preprocessing conclusions to generalize.

What would settle it

A third independent medieval manuscript collection where freezing the first three encoder layers or six decoder layers produces a statistically significant rise in character error rate above the reported thresholds.

Figures

Figures reproduced from arXiv: 2606.24302 by Federico Simonetta, Michele Flammini, Sachin Sharma.

**Figure 1.** Figure 1: Example of a page from the I-CT 91 “Cortonese” manuscript All experiments were performed on the I-CT 91 “Cortonese” manuscript ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of metric values related to losses and Grad-CAM/attention maps. Violin plots and box plots are computed on the test set of the I-CT 91 dataset. Normalized entropy is defined as: H′ = − P i pi log pi log(n) where pi denotes the normalized activation at patch i, and n is the total number of patches. The normalized entropy satisfies H′ ∈ [0, 1] and measures how diffusely the attention map distrib… view at source ↗

read the original abstract

Fine-tuning transformer-based handwritten text recognition (HTR) models on medieval manuscripts is challenging because these models are pre-trained on modern text and must adapt to a very different visual domain. This paper studies how three controllable fine-tuning choices (contrast normalization, data augmentation, and layer freezing) affect recognition accuracy when adapting TrOCR to small historical datasets. We run controlled experiments on a 13th-century Italian manuscript (I-CT 91 "Cortonese") and replicate the same experimental grid on the public READ-16 benchmark as robustness evidence. On Cortonese, our best configuration achieves 8.03% character error rate (CER). Statistical comparisons across 13 configurations show that freezing up to three encoder layers or six decoder layers does not significantly harm accuracy, while deeper freezing becomes progressively detrimental. Removing contrast normalization (CLAHE) yields 7.84% CER, comparable to a domain-specialized baseline, suggesting strong optimization can reduce reliance on image preprocessing. Cross-dataset validation on READ-16 shows that decoder freezing thresholds transfer more robustly than encoder thresholds, and combined freezing strategies require dataset-specific re-validation. Finally, we use Grad-CAM gradient attributions and decoder cross-attention maps to diagnose error patterns and failure modes revealed by the ablations. Source code is available at https://github.com/LaudareProject/TrOCR-analysis

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This ablation gives practical numbers on safe layer freezing for TrOCR on medieval HTR and shows CLAHE can sometimes be skipped, but the statistical backing for the thresholds is missing key details.

read the letter

The main thing to know is that freezing up to three encoder layers or six decoder layers keeps CER stable on these datasets, while deeper freezing hurts, and dropping CLAHE still hits 7.84% CER.

They ran a 13-configuration grid on the Cortonese manuscript and repeated it on READ-16. The best setup reached 8.03% CER. Decoder freezing thresholds held up better across the two sets than encoder ones. They also added Grad-CAM and cross-attention maps to trace error patterns. Code is released.

The controlled grid and the public-benchmark replication are the solid parts. Those choices make the freezing and preprocessing findings more credible than a single-dataset run would be. The attention diagnostics give a bit more insight into what breaks when you freeze too much.

The weak spot is the statistics. The claim that certain freezing levels cause “no significant harm” rests on comparisons across the 13 setups, yet the abstract gives no numbers on independent runs, standard deviations, or the exact test. Small medieval datasets are sensitive to initialization noise, so without those details the thresholds are hard to take at face value. The cross-dataset transfer conclusion for decoder versus encoder freezing inherits the same uncertainty.

This is for people who actually fine-tune HTR models on historical manuscripts and want concrete starting points for layer choices and preprocessing. The experiments are clean enough and the questions narrow enough that it deserves referee time, though the stats section will need work.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic ablation study on fine-tuning the TrOCR transformer model for handwritten text recognition (HTR) on medieval manuscripts. It examines three controllable choices—contrast normalization via CLAHE, data augmentation, and layer freezing—across 13 configurations on the 13th-century Cortonese manuscript (I-CT 91), with replication on the public READ-16 benchmark. The central empirical claims are that freezing up to three encoder layers or six decoder layers does not significantly harm CER (best result 8.03%), deeper freezing is progressively detrimental, removing CLAHE yields 7.84% CER comparable to specialized baselines, decoder freezing thresholds transfer more robustly across datasets than encoder ones, and Grad-CAM plus decoder attention maps diagnose error patterns. Public code is provided.

Significance. If the statistical thresholds hold, the work supplies actionable, dataset-aware guidelines for efficient adaptation of modern pre-trained HTR transformers to small historical corpora, reducing reliance on preprocessing and unnecessary fine-tuning. Strengths include the controlled experimental grid, cross-dataset replication, public code for reproducibility, and use of Grad-CAM/attention visualizations to link ablations to failure modes. These elements make the findings more falsifiable and extensible than typical single-dataset HTR ablations.

major comments (2)

[Abstract / Results (statistical comparisons)] Abstract and results section (statistical comparisons across the 13 configurations): the claim that “freezing up to three encoder layers or six decoder layers does not significantly harm accuracy” is load-bearing for the headline thresholds, yet the manuscript supplies no information on the number of independent runs per configuration, standard deviations or confidence intervals, the exact test statistic, or multiple-comparison correction. On small medieval datasets, single-run CER differences are frequently dominated by initialization noise; without these quantities the “no significant harm” thresholds cannot be distinguished from under-powered variation.
[Cross-dataset validation] Cross-dataset validation section: the assertion that “decoder freezing thresholds transfer more robustly than encoder thresholds” and that “combined freezing strategies require dataset-specific re-validation” rests on the same unreported statistical apparatus. Without per-dataset variance estimates or a direct test of the interaction between freezing depth and dataset, the differential transferability conclusion remains unverified.

minor comments (2)

[Abstract] Abstract: the phrase “13 configurations” is used without a forward reference to the table or figure that enumerates them; adding such a pointer would improve readability.
[Conclusion / reproducibility statement] The manuscript states that source code is available at the cited GitHub repository; confirming that the repository contains the exact configuration files, random seeds, and evaluation scripts used for the reported CER values would further strengthen reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater statistical transparency in our ablation study. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Results (statistical comparisons)] Abstract and results section (statistical comparisons across the 13 configurations): the claim that “freezing up to three encoder layers or six decoder layers does not significantly harm accuracy” is load-bearing for the headline thresholds, yet the manuscript supplies no information on the number of independent runs per configuration, standard deviations or confidence intervals, the exact test statistic, or multiple-comparison correction. On small medieval datasets, single-run CER differences are frequently dominated by initialization noise; without these quantities the “no significant harm” thresholds cannot be distinguished from under-powered variation.

Authors: We agree that the manuscript does not report the number of independent runs, standard deviations, or formal statistical tests. Each of the 13 configurations was executed as a single training run owing to computational constraints. The phrase 'does not significantly harm accuracy' was intended to describe small absolute CER differences (typically <0.5 percentage points) rather than the outcome of a hypothesis test. In revision we will (i) remove or qualify all uses of 'significantly', (ii) explicitly state that results reflect single runs, and (iii) add a limitations paragraph noting that variance estimation via multiple random seeds would strengthen future claims. revision: yes
Referee: [Cross-dataset validation] Cross-dataset validation section: the assertion that “decoder freezing thresholds transfer more robustly than encoder thresholds” and that “combined freezing strategies require dataset-specific re-validation” rests on the same unreported statistical apparatus. Without per-dataset variance estimates or a direct test of the interaction between freezing depth and dataset, the differential transferability conclusion remains unverified.

Authors: The cross-dataset observations likewise rest on single-run CER values without variance estimates or an interaction test. The claim of more robust decoder transfer is based on the pattern that the same decoder-freezing depths preserve low CER on both Cortonese and READ-16, while encoder thresholds vary more. We will revise the text to present these as descriptive patterns rather than statistically verified differences and will add the same limitations note regarding multiple runs. revision: yes

Circularity Check

0 steps flagged

Purely empirical ablation study with no derivations or self-referential predictions

full rationale

The paper reports results from controlled experiments varying contrast normalization, data augmentation, and layer freezing on two datasets, measuring CER. No equations, predictions, or derivations appear in the provided text that could reduce to quantities defined by the paper's own fitted parameters or self-citations. Statistical comparisons are presented as direct empirical outcomes rather than constructed from internal definitions. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical ML study with no new mathematical axioms, free parameters fitted to target the central claim, or invented entities.

axioms (1)

domain assumption Standard deep-learning fine-tuning assumptions hold, including that the TrOCR architecture remains suitable after domain shift to medieval handwriting.
Invoked implicitly when treating layer freezing and contrast choices as the only variables under test.

pith-pipeline@v0.9.1-grok · 5774 in / 1174 out tokens · 12304 ms · 2026-06-26T00:53:17.694818+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 15 canonical work pages

[1]

Aguilar, S.T.: TRIDIS: A comprehensive medieval and early modern cor- pus for HTR and NER (2025),https://doi.org/10.48550/arXiv.2503. 22714

work page doi:10.48550/arxiv.2503 2025
[2]

In: Barney Smith, E.H., Liwicki, M., Peng, L

Belay, B.H., Guyon, I., Mengiste, T., Tilahun, B., Liwicki, M., Tegegne, T., Egele, R.: A historical handwritten dataset for ethiopic OCR with baseline models and human-level performance. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds.) IEEE International Conference on Document Analysis and Recognition (2024)

2024
[3]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 782–791 (2021),https://doi. org/10.1109/CVPR46437.2021.00083

work page doi:10.1109/cvpr46437.2021.00083 2021
[4]

In: Barney Smith, E.H., Liwicki, M., Peng, L

Clérice, T., Pinche, A., Vlachou-Efstathiou, M., Chagué, A., Camps, J.B., Levenson, M.G., Brisville-Fertin, O., Boschetti, F., Fischer, F., Gervers, M., Boutreux, A., Manton, A., Gabay, S., O’Connor, P., Haverals, W., Kestemont, M., Vandyck, C., Kiessling, B.: CATMuS medieval: A mul- tilingual large-scale cross-century dataset in latin script for handwrit...

2024
[5]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(1), 508–524 (2020)

Coquenet,D.,Chatelain,C.,Paquet,T.:End-to-endhandwrittenparagraph text recognition using a vertical attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 508–524 (Jan 2023), ISSN 1939-3539,https://doi.org/10.1109/TPAMI.2022.3144899

work page doi:10.1109/tpami.2022.3144899 2023
[6]

In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp

Fischer, A., Frinken, V., Fornés, A., Bunke, H.: Transcription alignment of latin manuscripts using hidden markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 29–36, HIP ’11, Association for Computing Machinery (2011),https://doi.org/10. 1145/2037342.2037348

arXiv 2011
[7]

Algorithms16(3), 136 (2023), https://doi.org/10.3390/a16030136

Fischer, N., Hartelt, A., Puppe, F.: Line-level layout recognition of histori- cal documents with background knowledge. Algorithms16(3), 136 (2023), https://doi.org/10.3390/a16030136

work page doi:10.3390/a16030136 2023
[8]

Fine-tuning for handwritten text recognition

Hüttner, L., Mayr, M., Gorges, T., Wu, F., Seuret, M., Maier, A., Christlein, V.: Low-rank adaptation vs. Fine-tuning for handwritten text recognition. In:2025IEEE/CVFWinterConferenceonApplicationsofComputerVision Workshops (WACVW), pp. 1233–1242 (Feb 2025), ISSN 2690-621X,https: //doi.org/10.1109/WACVW65960.2025.00146

work page doi:10.1109/wacvw65960.2025.00146 2025
[9]

In: Yin, X.C., Karatzas, D., Lopresti, D

Kiessling, B.: Version 5 of the kraken ATR engine for the humanities. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recog- nition – ICDAR 2025, pp. 443–458, Springer Nature Switzerland, Cham (2026),https://doi.org/10.1007/978-3-032-04624-6_26 Title Suppressed Due to Excessive Length 15

work page doi:10.1007/978-3-032-04624-6_26 2025
[10]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Eu- ropean Conference on Computer Vision (2021)

2021
[11]

arXiv preprint arXiv:2111.156647(15), 2 (2021), URLhttps://arxiv.org/abs/ 2111.15664

Kim, G., Hong, T., Yim, M., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Donut: Document understanding transformer without ocr. arXiv preprint arXiv:2111.156647(15), 2 (2021), URLhttps://arxiv.org/abs/ 2111.15664

arXiv 2021
[12]

In: Coustaty, M., Fornés, A

Kim, G., Yokoo, S., Seo, S., Osanai, A., Okamoto, Y., Baek, Y.: On text localization in end-to-end OCR-free document understanding transformer without text localization supervision. In: Coustaty, M., Fornés, A. (eds.) ICDAR Workshops (2023)

2023
[13]

Kumari, L., Singh, S., Rathore, V.V.S., Sharma, A.: GatedLexiconNet: A comprehensive end-to-end handwritten paragraph text recognition system (Apr 2024),https://doi.org/10.48550/arXiv.2404.14062

work page doi:10.48550/arxiv.2404.14062 2024
[14]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., Wei, F.: TrOCR: Transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (Jun 2023),https://doi.org/10. 1609/aaai.v37i11.26538

2023
[15]

Jlis.it16(2), 59–85 (May 2025), ISSN 2038- 1026,https://doi.org/10.36253/jlis.it-641

Loss, E., Guernaccini, F., Carassai, M.: From manuscript to metadata: Ex- periments on handwritten text recognition, tagging and importation for the memoriali series (1265-1452). Jlis.it16(2), 59–85 (May 2025), ISSN 2038- 1026,https://doi.org/10.36253/jlis.it-641

work page doi:10.36253/jlis.it-641 2025
[16]

Matos, A., Almeida, P., Correia, P., Pacheco, O.: iForal: Automated hand- written text transcription for historical medieval manuscripts. J. Imag- ing11(2), 36 (Jan 2025), ISSN 2313-433X,https://doi.org/10.3390/ jimaging11020036

2025
[17]

Meoded, E.: Handwritten text recognition of historical manuscripts us- ing transformer-based models (Aug 2025),https://doi.org/10.48550/ arXiv.2508.11499

arXiv 2025
[18]

Pinche, A., Clérice, T., Chagué, A., Camps, J.B., Vlachou-Efstathiou, M., Gille Levenson, M., Brisville-Fertin, O., Boschetti, F., Fischer, F., Gervers, M., Boutreux, A., Manton, A., Gabay, S.: Catmus medieval (Jul 2024), https://doi.org/10.5281/zenodo.12743230

work page doi:10.5281/zenodo.12743230 2024
[19]

Computer Vision, Graphics, and Image Processing39,355–368(1987),URLhttps://api.semanticscholar.org/ CorpusID:62771950

Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., ter Haar Romeny, B.M., Zimmerman, J.B.: Adaptive histogram equalization and its variations. Computer Vision, Graphics, and Image Processing39,355–368(1987),URLhttps://api.semanticscholar.org/ CorpusID:62771950

1987
[20]

In: Proc

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual Explanations from Deep Networks Via Gradient-based Localization. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 618– 626 (Oct 2017), ISSN 2380-7504,https://doi.org/10.1109/ICCV.2017. 74 16 S. Sharma & F. Simonetta

work page doi:10.1109/iccv.2017 2017
[21]

In: IEEE International Workshop on Machine Learning for Signal Processing, arXiv, Istanbul (Jul 2025),https://doi.org/10.48550/arXiv.2507.15633

Sharma, S., Simonetta, F., Flammini, M.: Experimenting active and se- quential learning in a medieval music manuscript. In: IEEE International Workshop on Machine Learning for Signal Processing, arXiv, Istanbul (Jul 2025),https://doi.org/10.48550/arXiv.2507.15633

work page doi:10.48550/arxiv.2507.15633 2025
[22]

In: AudioMostly ’24, ACM, Milan, Italy (Aug 2024),https://doi.org/10.1145/3678299

Simonetta, F., Mondal, R., Ludovico, L.A., Ntalampiras, S.: Optical mu- sic recognition in manuscripts from the ricordi archive. In: AudioMostly ’24, ACM, Milan, Italy (Aug 2024),https://doi.org/10.1145/3678299. 3678324

work page doi:10.1145/3678299 2024
[23]

In: Coustaty, M., Fornés, A

Ströbel, P.B., Hodel, T., Boente, W., Volk, M.: The adaptability of a transformer-based OCR model for historical documents. In: Coustaty, M., Fornés, A. (eds.) ICDAR Workshops (2023)

2023
[24]

Ströbel, P.B., Clematide, S., Volk, M., Hodel, T.: Transformer-based htr for historical documents (2022), URLhttps://arxiv.org/abs/2203.11008

arXiv 2022
[25]

effec- tive prior

Toselli, A.H., Romero, V., Villegas, M., Vidal, E., Sánchez, J.A.: Htr dataset icfhr 2016 [data set]. Zenodo (2018),https://doi.org/10.5281/zenodo. 1164045

work page doi:10.5281/zenodo 2016
[26]

In: Proceed- ings of the 3rd International Workshop on Historical Document Imaging and Processing, pp

Toselli, A.H., Vidal, E.: Handwritten text recognition results on the ben- tham collection with improved classical n-gram-hmm methods. In: Proceed- ings of the 3rd International Workshop on Historical Document Imaging and Processing, pp. 15–22, HIP ’15, Association for Computing Machinery (2015),https://doi.org/10.1145/2809544.2809551

work page doi:10.1145/2809544.2809551 2015
[27]

URLhttp://dx.doi

Xu,Y.,Li,M.,Cui,L.,Huang,S.,Wei,F.,Zhou,M.:Layoutlm:Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200, KDD ’20, ACM (Aug 2020),https://doi. org/10.1145/3394486.3403172

work page doi:10.1145/3394486.3403172 2020
[28]

Zuiderveld, K.: Contrast limited adaptive histogram equalization, pp. 474–
[29]

Academic Press Professional, Inc., USA (1994), ISBN 0123361559

1994

[1] [1]

Aguilar, S.T.: TRIDIS: A comprehensive medieval and early modern cor- pus for HTR and NER (2025),https://doi.org/10.48550/arXiv.2503. 22714

work page doi:10.48550/arxiv.2503 2025

[2] [2]

In: Barney Smith, E.H., Liwicki, M., Peng, L

Belay, B.H., Guyon, I., Mengiste, T., Tilahun, B., Liwicki, M., Tegegne, T., Egele, R.: A historical handwritten dataset for ethiopic OCR with baseline models and human-level performance. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds.) IEEE International Conference on Document Analysis and Recognition (2024)

2024

[3] [3]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 782–791 (2021),https://doi. org/10.1109/CVPR46437.2021.00083

work page doi:10.1109/cvpr46437.2021.00083 2021

[4] [4]

In: Barney Smith, E.H., Liwicki, M., Peng, L

Clérice, T., Pinche, A., Vlachou-Efstathiou, M., Chagué, A., Camps, J.B., Levenson, M.G., Brisville-Fertin, O., Boschetti, F., Fischer, F., Gervers, M., Boutreux, A., Manton, A., Gabay, S., O’Connor, P., Haverals, W., Kestemont, M., Vandyck, C., Kiessling, B.: CATMuS medieval: A mul- tilingual large-scale cross-century dataset in latin script for handwrit...

2024

[5] [5]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(1), 508–524 (2020)

Coquenet,D.,Chatelain,C.,Paquet,T.:End-to-endhandwrittenparagraph text recognition using a vertical attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 508–524 (Jan 2023), ISSN 1939-3539,https://doi.org/10.1109/TPAMI.2022.3144899

work page doi:10.1109/tpami.2022.3144899 2023

[6] [6]

In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp

Fischer, A., Frinken, V., Fornés, A., Bunke, H.: Transcription alignment of latin manuscripts using hidden markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 29–36, HIP ’11, Association for Computing Machinery (2011),https://doi.org/10. 1145/2037342.2037348

arXiv 2011

[7] [7]

Algorithms16(3), 136 (2023), https://doi.org/10.3390/a16030136

Fischer, N., Hartelt, A., Puppe, F.: Line-level layout recognition of histori- cal documents with background knowledge. Algorithms16(3), 136 (2023), https://doi.org/10.3390/a16030136

work page doi:10.3390/a16030136 2023

[8] [8]

Fine-tuning for handwritten text recognition

Hüttner, L., Mayr, M., Gorges, T., Wu, F., Seuret, M., Maier, A., Christlein, V.: Low-rank adaptation vs. Fine-tuning for handwritten text recognition. In:2025IEEE/CVFWinterConferenceonApplicationsofComputerVision Workshops (WACVW), pp. 1233–1242 (Feb 2025), ISSN 2690-621X,https: //doi.org/10.1109/WACVW65960.2025.00146

work page doi:10.1109/wacvw65960.2025.00146 2025

[9] [9]

In: Yin, X.C., Karatzas, D., Lopresti, D

Kiessling, B.: Version 5 of the kraken ATR engine for the humanities. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recog- nition – ICDAR 2025, pp. 443–458, Springer Nature Switzerland, Cham (2026),https://doi.org/10.1007/978-3-032-04624-6_26 Title Suppressed Due to Excessive Length 15

work page doi:10.1007/978-3-032-04624-6_26 2025

[10] [10]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Eu- ropean Conference on Computer Vision (2021)

2021

[11] [11]

arXiv preprint arXiv:2111.156647(15), 2 (2021), URLhttps://arxiv.org/abs/ 2111.15664

Kim, G., Hong, T., Yim, M., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Donut: Document understanding transformer without ocr. arXiv preprint arXiv:2111.156647(15), 2 (2021), URLhttps://arxiv.org/abs/ 2111.15664

arXiv 2021

[12] [12]

In: Coustaty, M., Fornés, A

Kim, G., Yokoo, S., Seo, S., Osanai, A., Okamoto, Y., Baek, Y.: On text localization in end-to-end OCR-free document understanding transformer without text localization supervision. In: Coustaty, M., Fornés, A. (eds.) ICDAR Workshops (2023)

2023

[13] [13]

Kumari, L., Singh, S., Rathore, V.V.S., Sharma, A.: GatedLexiconNet: A comprehensive end-to-end handwritten paragraph text recognition system (Apr 2024),https://doi.org/10.48550/arXiv.2404.14062

work page doi:10.48550/arxiv.2404.14062 2024

[14] [14]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., Wei, F.: TrOCR: Transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (Jun 2023),https://doi.org/10. 1609/aaai.v37i11.26538

2023

[15] [15]

Jlis.it16(2), 59–85 (May 2025), ISSN 2038- 1026,https://doi.org/10.36253/jlis.it-641

Loss, E., Guernaccini, F., Carassai, M.: From manuscript to metadata: Ex- periments on handwritten text recognition, tagging and importation for the memoriali series (1265-1452). Jlis.it16(2), 59–85 (May 2025), ISSN 2038- 1026,https://doi.org/10.36253/jlis.it-641

work page doi:10.36253/jlis.it-641 2025

[16] [16]

Matos, A., Almeida, P., Correia, P., Pacheco, O.: iForal: Automated hand- written text transcription for historical medieval manuscripts. J. Imag- ing11(2), 36 (Jan 2025), ISSN 2313-433X,https://doi.org/10.3390/ jimaging11020036

2025

[17] [17]

Meoded, E.: Handwritten text recognition of historical manuscripts us- ing transformer-based models (Aug 2025),https://doi.org/10.48550/ arXiv.2508.11499

arXiv 2025

[18] [18]

Pinche, A., Clérice, T., Chagué, A., Camps, J.B., Vlachou-Efstathiou, M., Gille Levenson, M., Brisville-Fertin, O., Boschetti, F., Fischer, F., Gervers, M., Boutreux, A., Manton, A., Gabay, S.: Catmus medieval (Jul 2024), https://doi.org/10.5281/zenodo.12743230

work page doi:10.5281/zenodo.12743230 2024

[19] [19]

Computer Vision, Graphics, and Image Processing39,355–368(1987),URLhttps://api.semanticscholar.org/ CorpusID:62771950

Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., ter Haar Romeny, B.M., Zimmerman, J.B.: Adaptive histogram equalization and its variations. Computer Vision, Graphics, and Image Processing39,355–368(1987),URLhttps://api.semanticscholar.org/ CorpusID:62771950

1987

[20] [20]

In: Proc

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual Explanations from Deep Networks Via Gradient-based Localization. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 618– 626 (Oct 2017), ISSN 2380-7504,https://doi.org/10.1109/ICCV.2017. 74 16 S. Sharma & F. Simonetta

work page doi:10.1109/iccv.2017 2017

[21] [21]

In: IEEE International Workshop on Machine Learning for Signal Processing, arXiv, Istanbul (Jul 2025),https://doi.org/10.48550/arXiv.2507.15633

Sharma, S., Simonetta, F., Flammini, M.: Experimenting active and se- quential learning in a medieval music manuscript. In: IEEE International Workshop on Machine Learning for Signal Processing, arXiv, Istanbul (Jul 2025),https://doi.org/10.48550/arXiv.2507.15633

work page doi:10.48550/arxiv.2507.15633 2025

[22] [22]

In: AudioMostly ’24, ACM, Milan, Italy (Aug 2024),https://doi.org/10.1145/3678299

Simonetta, F., Mondal, R., Ludovico, L.A., Ntalampiras, S.: Optical mu- sic recognition in manuscripts from the ricordi archive. In: AudioMostly ’24, ACM, Milan, Italy (Aug 2024),https://doi.org/10.1145/3678299. 3678324

work page doi:10.1145/3678299 2024

[23] [23]

In: Coustaty, M., Fornés, A

Ströbel, P.B., Hodel, T., Boente, W., Volk, M.: The adaptability of a transformer-based OCR model for historical documents. In: Coustaty, M., Fornés, A. (eds.) ICDAR Workshops (2023)

2023

[24] [24]

Ströbel, P.B., Clematide, S., Volk, M., Hodel, T.: Transformer-based htr for historical documents (2022), URLhttps://arxiv.org/abs/2203.11008

arXiv 2022

[25] [25]

effec- tive prior

Toselli, A.H., Romero, V., Villegas, M., Vidal, E., Sánchez, J.A.: Htr dataset icfhr 2016 [data set]. Zenodo (2018),https://doi.org/10.5281/zenodo. 1164045

work page doi:10.5281/zenodo 2016

[26] [26]

In: Proceed- ings of the 3rd International Workshop on Historical Document Imaging and Processing, pp

Toselli, A.H., Vidal, E.: Handwritten text recognition results on the ben- tham collection with improved classical n-gram-hmm methods. In: Proceed- ings of the 3rd International Workshop on Historical Document Imaging and Processing, pp. 15–22, HIP ’15, Association for Computing Machinery (2015),https://doi.org/10.1145/2809544.2809551

work page doi:10.1145/2809544.2809551 2015

[27] [27]

URLhttp://dx.doi

Xu,Y.,Li,M.,Cui,L.,Huang,S.,Wei,F.,Zhou,M.:Layoutlm:Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200, KDD ’20, ACM (Aug 2020),https://doi. org/10.1145/3394486.3403172

work page doi:10.1145/3394486.3403172 2020

[28] [28]

Zuiderveld, K.: Contrast limited adaptive histogram equalization, pp. 474–

[29] [29]

Academic Press Professional, Inc., USA (1994), ISBN 0123361559

1994