TrOCR for Medieval HTR: A Systematic Ablation Study with Cross-Dataset Validation
Pith reviewed 2026-06-26 00:53 UTC · model grok-4.3
The pith
Freezing up to three encoder layers or six decoder layers in TrOCR preserves accuracy on medieval manuscripts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Systematic ablation shows that freezing the first three encoder layers or the first six decoder layers produces no statistically significant accuracy loss on the Cortonese manuscript, while removing CLAHE contrast normalization yields 7.84 percent CER comparable to specialized baselines; cross-dataset replication on READ-16 confirms that decoder freezing strategies transfer more robustly and that combined freezing requires dataset-specific checks.
What carries the argument
Controlled ablation grid over contrast normalization, data augmentation, and selective layer freezing in TrOCR encoder and decoder, evaluated by character error rate with statistical comparisons and attention-map diagnostics.
If this is right
- Moderate encoder or decoder freezing reduces training cost without harming recognition accuracy.
- Strong optimization can achieve competitive CER without contrast normalization preprocessing.
- Decoder freezing thresholds generalize better across datasets than encoder thresholds.
- Grad-CAM and cross-attention maps reveal error patterns tied to specific ablation choices.
Where Pith is reading between the lines
- The observed freezing thresholds could reduce the compute needed when adapting similar vision transformers to other low-resource historical scripts.
- If decoder layers prove more robust to freezing in additional domains, future work could prioritize decoder-side parameter-efficient tuning.
- The finding that preprocessing can sometimes be dropped suggests testing whether end-to-end optimization alone suffices on even smaller or more degraded manuscript collections.
Load-bearing premise
The statistical tests across the thirteen configurations are reliable and the Cortonese plus READ-16 datasets are representative enough for the freezing and preprocessing conclusions to generalize.
What would settle it
A third independent medieval manuscript collection where freezing the first three encoder layers or six decoder layers produces a statistically significant rise in character error rate above the reported thresholds.
Figures
read the original abstract
Fine-tuning transformer-based handwritten text recognition (HTR) models on medieval manuscripts is challenging because these models are pre-trained on modern text and must adapt to a very different visual domain. This paper studies how three controllable fine-tuning choices (contrast normalization, data augmentation, and layer freezing) affect recognition accuracy when adapting TrOCR to small historical datasets. We run controlled experiments on a 13th-century Italian manuscript (I-CT 91 "Cortonese") and replicate the same experimental grid on the public READ-16 benchmark as robustness evidence. On Cortonese, our best configuration achieves 8.03% character error rate (CER). Statistical comparisons across 13 configurations show that freezing up to three encoder layers or six decoder layers does not significantly harm accuracy, while deeper freezing becomes progressively detrimental. Removing contrast normalization (CLAHE) yields 7.84% CER, comparable to a domain-specialized baseline, suggesting strong optimization can reduce reliance on image preprocessing. Cross-dataset validation on READ-16 shows that decoder freezing thresholds transfer more robustly than encoder thresholds, and combined freezing strategies require dataset-specific re-validation. Finally, we use Grad-CAM gradient attributions and decoder cross-attention maps to diagnose error patterns and failure modes revealed by the ablations. Source code is available at https://github.com/LaudareProject/TrOCR-analysis
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a systematic ablation study on fine-tuning the TrOCR transformer model for handwritten text recognition (HTR) on medieval manuscripts. It examines three controllable choices—contrast normalization via CLAHE, data augmentation, and layer freezing—across 13 configurations on the 13th-century Cortonese manuscript (I-CT 91), with replication on the public READ-16 benchmark. The central empirical claims are that freezing up to three encoder layers or six decoder layers does not significantly harm CER (best result 8.03%), deeper freezing is progressively detrimental, removing CLAHE yields 7.84% CER comparable to specialized baselines, decoder freezing thresholds transfer more robustly across datasets than encoder ones, and Grad-CAM plus decoder attention maps diagnose error patterns. Public code is provided.
Significance. If the statistical thresholds hold, the work supplies actionable, dataset-aware guidelines for efficient adaptation of modern pre-trained HTR transformers to small historical corpora, reducing reliance on preprocessing and unnecessary fine-tuning. Strengths include the controlled experimental grid, cross-dataset replication, public code for reproducibility, and use of Grad-CAM/attention visualizations to link ablations to failure modes. These elements make the findings more falsifiable and extensible than typical single-dataset HTR ablations.
major comments (2)
- [Abstract / Results (statistical comparisons)] Abstract and results section (statistical comparisons across the 13 configurations): the claim that “freezing up to three encoder layers or six decoder layers does not significantly harm accuracy” is load-bearing for the headline thresholds, yet the manuscript supplies no information on the number of independent runs per configuration, standard deviations or confidence intervals, the exact test statistic, or multiple-comparison correction. On small medieval datasets, single-run CER differences are frequently dominated by initialization noise; without these quantities the “no significant harm” thresholds cannot be distinguished from under-powered variation.
- [Cross-dataset validation] Cross-dataset validation section: the assertion that “decoder freezing thresholds transfer more robustly than encoder thresholds” and that “combined freezing strategies require dataset-specific re-validation” rests on the same unreported statistical apparatus. Without per-dataset variance estimates or a direct test of the interaction between freezing depth and dataset, the differential transferability conclusion remains unverified.
minor comments (2)
- [Abstract] Abstract: the phrase “13 configurations” is used without a forward reference to the table or figure that enumerates them; adding such a pointer would improve readability.
- [Conclusion / reproducibility statement] The manuscript states that source code is available at the cited GitHub repository; confirming that the repository contains the exact configuration files, random seeds, and evaluation scripts used for the reported CER values would further strengthen reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for greater statistical transparency in our ablation study. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / Results (statistical comparisons)] Abstract and results section (statistical comparisons across the 13 configurations): the claim that “freezing up to three encoder layers or six decoder layers does not significantly harm accuracy” is load-bearing for the headline thresholds, yet the manuscript supplies no information on the number of independent runs per configuration, standard deviations or confidence intervals, the exact test statistic, or multiple-comparison correction. On small medieval datasets, single-run CER differences are frequently dominated by initialization noise; without these quantities the “no significant harm” thresholds cannot be distinguished from under-powered variation.
Authors: We agree that the manuscript does not report the number of independent runs, standard deviations, or formal statistical tests. Each of the 13 configurations was executed as a single training run owing to computational constraints. The phrase 'does not significantly harm accuracy' was intended to describe small absolute CER differences (typically <0.5 percentage points) rather than the outcome of a hypothesis test. In revision we will (i) remove or qualify all uses of 'significantly', (ii) explicitly state that results reflect single runs, and (iii) add a limitations paragraph noting that variance estimation via multiple random seeds would strengthen future claims. revision: yes
-
Referee: [Cross-dataset validation] Cross-dataset validation section: the assertion that “decoder freezing thresholds transfer more robustly than encoder thresholds” and that “combined freezing strategies require dataset-specific re-validation” rests on the same unreported statistical apparatus. Without per-dataset variance estimates or a direct test of the interaction between freezing depth and dataset, the differential transferability conclusion remains unverified.
Authors: The cross-dataset observations likewise rest on single-run CER values without variance estimates or an interaction test. The claim of more robust decoder transfer is based on the pattern that the same decoder-freezing depths preserve low CER on both Cortonese and READ-16, while encoder thresholds vary more. We will revise the text to present these as descriptive patterns rather than statistically verified differences and will add the same limitations note regarding multiple runs. revision: yes
Circularity Check
Purely empirical ablation study with no derivations or self-referential predictions
full rationale
The paper reports results from controlled experiments varying contrast normalization, data augmentation, and layer freezing on two datasets, measuring CER. No equations, predictions, or derivations appear in the provided text that could reduce to quantities defined by the paper's own fitted parameters or self-citations. Statistical comparisons are presented as direct empirical outcomes rather than constructed from internal definitions. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard deep-learning fine-tuning assumptions hold, including that the TrOCR architecture remains suitable after domain shift to medieval handwriting.
Reference graph
Works this paper leans on
-
[1]
Aguilar, S.T.: TRIDIS: A comprehensive medieval and early modern cor- pus for HTR and NER (2025),https://doi.org/10.48550/arXiv.2503. 22714
-
[2]
In: Barney Smith, E.H., Liwicki, M., Peng, L
Belay, B.H., Guyon, I., Mengiste, T., Tilahun, B., Liwicki, M., Tegegne, T., Egele, R.: A historical handwritten dataset for ethiopic OCR with baseline models and human-level performance. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds.) IEEE International Conference on Document Analysis and Recognition (2024)
2024
-
[3]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 782–791 (2021),https://doi. org/10.1109/CVPR46437.2021.00083
-
[4]
In: Barney Smith, E.H., Liwicki, M., Peng, L
Clérice, T., Pinche, A., Vlachou-Efstathiou, M., Chagué, A., Camps, J.B., Levenson, M.G., Brisville-Fertin, O., Boschetti, F., Fischer, F., Gervers, M., Boutreux, A., Manton, A., Gabay, S., O’Connor, P., Haverals, W., Kestemont, M., Vandyck, C., Kiessling, B.: CATMuS medieval: A mul- tilingual large-scale cross-century dataset in latin script for handwrit...
2024
-
[5]
IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(1), 508–524 (2020)
Coquenet,D.,Chatelain,C.,Paquet,T.:End-to-endhandwrittenparagraph text recognition using a vertical attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 508–524 (Jan 2023), ISSN 1939-3539,https://doi.org/10.1109/TPAMI.2022.3144899
-
[6]
In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp
Fischer, A., Frinken, V., Fornés, A., Bunke, H.: Transcription alignment of latin manuscripts using hidden markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 29–36, HIP ’11, Association for Computing Machinery (2011),https://doi.org/10. 1145/2037342.2037348
arXiv 2011
-
[7]
Algorithms16(3), 136 (2023), https://doi.org/10.3390/a16030136
Fischer, N., Hartelt, A., Puppe, F.: Line-level layout recognition of histori- cal documents with background knowledge. Algorithms16(3), 136 (2023), https://doi.org/10.3390/a16030136
-
[8]
Fine-tuning for handwritten text recognition
Hüttner, L., Mayr, M., Gorges, T., Wu, F., Seuret, M., Maier, A., Christlein, V.: Low-rank adaptation vs. Fine-tuning for handwritten text recognition. In:2025IEEE/CVFWinterConferenceonApplicationsofComputerVision Workshops (WACVW), pp. 1233–1242 (Feb 2025), ISSN 2690-621X,https: //doi.org/10.1109/WACVW65960.2025.00146
-
[9]
In: Yin, X.C., Karatzas, D., Lopresti, D
Kiessling, B.: Version 5 of the kraken ATR engine for the humanities. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recog- nition – ICDAR 2025, pp. 443–458, Springer Nature Switzerland, Cham (2026),https://doi.org/10.1007/978-3-032-04624-6_26 Title Suppressed Due to Excessive Length 15
-
[10]
In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Eu- ropean Conference on Computer Vision (2021)
2021
-
[11]
arXiv preprint arXiv:2111.156647(15), 2 (2021), URLhttps://arxiv.org/abs/ 2111.15664
Kim, G., Hong, T., Yim, M., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Donut: Document understanding transformer without ocr. arXiv preprint arXiv:2111.156647(15), 2 (2021), URLhttps://arxiv.org/abs/ 2111.15664
arXiv 2021
-
[12]
In: Coustaty, M., Fornés, A
Kim, G., Yokoo, S., Seo, S., Osanai, A., Okamoto, Y., Baek, Y.: On text localization in end-to-end OCR-free document understanding transformer without text localization supervision. In: Coustaty, M., Fornés, A. (eds.) ICDAR Workshops (2023)
2023
-
[13]
Kumari, L., Singh, S., Rathore, V.V.S., Sharma, A.: GatedLexiconNet: A comprehensive end-to-end handwritten paragraph text recognition system (Apr 2024),https://doi.org/10.48550/arXiv.2404.14062
-
[14]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., Wei, F.: TrOCR: Transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (Jun 2023),https://doi.org/10. 1609/aaai.v37i11.26538
2023
-
[15]
Jlis.it16(2), 59–85 (May 2025), ISSN 2038- 1026,https://doi.org/10.36253/jlis.it-641
Loss, E., Guernaccini, F., Carassai, M.: From manuscript to metadata: Ex- periments on handwritten text recognition, tagging and importation for the memoriali series (1265-1452). Jlis.it16(2), 59–85 (May 2025), ISSN 2038- 1026,https://doi.org/10.36253/jlis.it-641
-
[16]
Matos, A., Almeida, P., Correia, P., Pacheco, O.: iForal: Automated hand- written text transcription for historical medieval manuscripts. J. Imag- ing11(2), 36 (Jan 2025), ISSN 2313-433X,https://doi.org/10.3390/ jimaging11020036
2025
-
[17]
Meoded, E.: Handwritten text recognition of historical manuscripts us- ing transformer-based models (Aug 2025),https://doi.org/10.48550/ arXiv.2508.11499
arXiv 2025
-
[18]
Pinche, A., Clérice, T., Chagué, A., Camps, J.B., Vlachou-Efstathiou, M., Gille Levenson, M., Brisville-Fertin, O., Boschetti, F., Fischer, F., Gervers, M., Boutreux, A., Manton, A., Gabay, S.: Catmus medieval (Jul 2024), https://doi.org/10.5281/zenodo.12743230
-
[19]
Computer Vision, Graphics, and Image Processing39,355–368(1987),URLhttps://api.semanticscholar.org/ CorpusID:62771950
Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., ter Haar Romeny, B.M., Zimmerman, J.B.: Adaptive histogram equalization and its variations. Computer Vision, Graphics, and Image Processing39,355–368(1987),URLhttps://api.semanticscholar.org/ CorpusID:62771950
1987
-
[20]
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual Explanations from Deep Networks Via Gradient-based Localization. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 618– 626 (Oct 2017), ISSN 2380-7504,https://doi.org/10.1109/ICCV.2017. 74 16 S. Sharma & F. Simonetta
-
[21]
Sharma, S., Simonetta, F., Flammini, M.: Experimenting active and se- quential learning in a medieval music manuscript. In: IEEE International Workshop on Machine Learning for Signal Processing, arXiv, Istanbul (Jul 2025),https://doi.org/10.48550/arXiv.2507.15633
-
[22]
In: AudioMostly ’24, ACM, Milan, Italy (Aug 2024),https://doi.org/10.1145/3678299
Simonetta, F., Mondal, R., Ludovico, L.A., Ntalampiras, S.: Optical mu- sic recognition in manuscripts from the ricordi archive. In: AudioMostly ’24, ACM, Milan, Italy (Aug 2024),https://doi.org/10.1145/3678299. 3678324
-
[23]
In: Coustaty, M., Fornés, A
Ströbel, P.B., Hodel, T., Boente, W., Volk, M.: The adaptability of a transformer-based OCR model for historical documents. In: Coustaty, M., Fornés, A. (eds.) ICDAR Workshops (2023)
2023
-
[24]
Ströbel, P.B., Clematide, S., Volk, M., Hodel, T.: Transformer-based htr for historical documents (2022), URLhttps://arxiv.org/abs/2203.11008
arXiv 2022
-
[25]
Toselli, A.H., Romero, V., Villegas, M., Vidal, E., Sánchez, J.A.: Htr dataset icfhr 2016 [data set]. Zenodo (2018),https://doi.org/10.5281/zenodo. 1164045
-
[26]
Toselli, A.H., Vidal, E.: Handwritten text recognition results on the ben- tham collection with improved classical n-gram-hmm methods. In: Proceed- ings of the 3rd International Workshop on Historical Document Imaging and Processing, pp. 15–22, HIP ’15, Association for Computing Machinery (2015),https://doi.org/10.1145/2809544.2809551
-
[27]
Xu,Y.,Li,M.,Cui,L.,Huang,S.,Wei,F.,Zhou,M.:Layoutlm:Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200, KDD ’20, ACM (Aug 2020),https://doi. org/10.1145/3394486.3403172
-
[28]
Zuiderveld, K.: Contrast limited adaptive histogram equalization, pp. 474–
-
[29]
Academic Press Professional, Inc., USA (1994), ISBN 0123361559
1994
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.