Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing
Pith reviewed 2026-06-29 22:48 UTC · model grok-4.3
The pith
Fine-tuned RegNetY-16GF reaches 99.16 percent accuracy on classifying century-old scanned pages into 11 content categories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuned convolutional networks and vision transformers classify historical page images into an 11-category taxonomy with top-1 accuracies above 99 percent on a held-out test set; the same models produce mutually consistent labels on 649,508 unlabeled pages at greater than 90 percent inter-model agreement, while a multimodal CLIP variant shows markedly lower agreement with the image-only models on the unlabeled archive.
What carries the argument
Fine-tuned RegNetY-16GF convolutional network performing 11-way classification of page images.
If this is right
- Large-scale digitization projects can replace manual page sorting with automated classification at near-perfect accuracy.
- Content-specific pipelines become practical: OCR runs only on text pages, table extraction only on tabular pages.
- Image-only models maintain higher consistency on unlabeled archival data than multimodal CLIP variants.
- The released models, annotated dataset, and software enable direct reuse on other historical collections.
Where Pith is reading between the lines
- The same fine-tuning recipe could be tested on archives from other languages or time periods to check transferability.
- High inter-model agreement on unlabeled data suggests the visual features learned are stable across document aging and scanning variations.
- Deployment could reduce the fraction of pages requiring any human review in future digitization workflows.
Load-bearing premise
The labels produced by the four-stage expert process on the annotated subset remain representative and free of distribution shift when applied to the full century-spanning unlabeled archive.
What would settle it
A random sample of several hundred pages from the 649,508 automatically labeled set that experts re-annotate and find systematic category mismatches or inter-model agreement below 80 percent would falsify the generalization claim.
read the original abstract
Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops and evaluates image classifiers for 11-category content classification of historical scanned page images from Czech archaeological archives spanning a century. It establishes a ~75% Random Forest baseline on hand-crafted features, then fine-tunes CNNs (EfficientNetV2, RegNetY), transformers (ViT, DiT), and CLIP models on a 48k-image expert-annotated dataset refined over four annotation stages. RegNetY-16GF reaches 99.16% and ViT-large 99.12% Top-1 accuracy on a held-out test set via five-fold cross-validation; the models are applied to 649k unlabeled pages, where image-only models show >90% inter-model agreement while CLIP shows lower agreement. The annotated data, models, and code are released publicly.
Significance. If the reported test-set accuracies and generalization hold, the work supplies a practical, high-accuracy pipeline for automated triage of large heterogeneous historical archives, directly enabling content-specific downstream processing. The public release of the dataset, trained models, and software constitutes a clear strength for reproducibility and follow-on research in document image analysis.
major comments (2)
- [Abstract / Conclusion] Abstract and Conclusion: the claim that the fine-tuned models 'deliver reliable labels' on the 649,508 unlabeled pages rests on 99+% held-out accuracy plus >90% inter-model agreement. Inter-model agreement is only a consistency metric and does not rule out correlated errors under distribution shift; no sampling, expert re-annotation, or ground-truth check on any subset of the unlabeled archive is reported to test transfer from the 48k annotated pages.
- [Methods] Methods (five-fold cross-validation description): the manuscript supplies no information on whether the test fold was strictly isolated from all hyperparameter tuning, model selection, and data-augmentation decisions, nor on class-imbalance handling or label-noise mitigation during the four-stage annotation process.
minor comments (1)
- [Abstract] Abstract: the baseline accuracy is given only as 'approximately 75%'; reporting the precise figure and the feature set used would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where the manuscript requires clarification or adjustment.
read point-by-point responses
-
Referee: [Abstract / Conclusion] Abstract and Conclusion: the claim that the fine-tuned models 'deliver reliable labels' on the 649,508 unlabeled pages rests on 99+% held-out accuracy plus >90% inter-model agreement. Inter-model agreement is only a consistency metric and does not rule out correlated errors under distribution shift; no sampling, expert re-annotation, or ground-truth check on any subset of the unlabeled archive is reported to test transfer from the 48k annotated pages.
Authors: We agree that inter-model agreement is a consistency measure and does not by itself rule out systematic errors under distribution shift. The manuscript's conclusion uses the phrasing 'produce consistent labels' rather than asserting ground-truth reliability on the unlabeled set. To avoid any overstatement, we will revise the abstract and conclusion to explicitly frame the 649k labels as inferred from high held-out accuracy and cross-model consistency, and we will add a limitations paragraph noting the absence of direct validation on the unlabeled archive. If feasible within the revision timeline, we will also report a small expert-reviewed subsample of the unlabeled pages. revision: yes
-
Referee: [Methods] Methods (five-fold cross-validation description): the manuscript supplies no information on whether the test fold was strictly isolated from all hyperparameter tuning, model selection, and data-augmentation decisions, nor on class-imbalance handling or label-noise mitigation during the four-stage annotation process.
Authors: The five-fold cross-validation was performed with the test fold fully isolated: all hyperparameter search, model selection, and augmentation decisions were made exclusively on the training folds of each split. Class imbalance was addressed via class-weighted cross-entropy loss during fine-tuning. Label noise was mitigated through the four-stage expert annotation pipeline, in which each page received independent review by at least two domain experts with final adjudication on disagreements. We will expand the Methods section with these procedural details and a brief description of the annotation stages. revision: yes
Circularity Check
No circularity: standard empirical held-out evaluation on annotated archive
full rationale
The paper performs supervised fine-tuning of image classifiers on a 48k-page annotated dataset (four-stage expert process) and reports Top-1 accuracy on a held-out test split plus inter-model agreement on 649k unlabeled pages. No equations, derivations, or self-citations are load-bearing; the reported numbers are direct empirical measurements from standard train/test splits and consistency checks. No quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is smuggled in. The work is self-contained against external benchmarks (held-out accuracy) and receives the default non-circular finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 11-category label scheme designed collaboratively with domain experts accurately reflects distinct visual content types suitable for routing to OCR or structured extraction.
Reference graph
Works this paper leans on
-
[1]
Nikolaidou, K., Seuret, M., Mokayed, H., Liwicki, M.: A survey of historical doc- ument image datasets. International Jour- nal on Document Analysis and Recognition (IJDAR)25(4), 305–338 (2022) https://doi. org/10.1007/s10032-021-00390-6
-
[2]
In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: Largest dataset ever for document lay- out analysis. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 1015–1022 (2019). https://doi. org/10.1109/ICDAR.2019.00166 . IEEE
-
[3]
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020). https://doi.org/10.1145/ 3394486.3403172
-
[4]
Page image classification for content-specific data processing
Lutsai, K., Straˇ n´ ak, P.: Page image classi- fication for content-specific data processing (2026). https://arxiv.org/abs/2507.21114
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Lewis, D.D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document informa- tion processing. In: Proceedings of the 29th Annual International ACM SIGIR Confer- ence on Research and Development in Infor- mation Retrieval, pp. 665–666 (2006). https: //doi.org/10.1145/1148170.1148307
-
[6]
In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for doc- ument image classification and retrieval. In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp. 991–995 (2015). https://doi.org/10.1109/ ICDAR.2015.7333910 . IEEE
-
[7]
Neurocomput- ing453, 223–240 (2021) https://doi.org/10
Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., Suen, C.Y.: Document image classification: Progress over two decades. Neurocomput- ing453, 223–240 (2021) https://doi.org/10. 1016/j.neucom.2021.05.003
2021
-
[8]
In: International Conference on Machine Learning (ICML), pp
Tan, M., Le, Q.V.: EfficientNetV2: Smaller models and faster training. In: International Conference on Machine Learning (ICML), pp. 10096–10106 (2021). PMLR
2021
-
[9]
In: International Conference on Machine Learning (ICML), pp
Tan, M., Le, Q.: EfficientNet: Rethink- ing model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 6105–6114 (2019). PMLR
2019
-
[10]
nuScenes: a multimodal dataset for autonomous driving
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Doll´ ar, P.: Designing net- work design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10428–10436 (2020). https://doi.org/10. 1109/CVPR42600.2020.01044 23
-
[11]
In: International Conference on Learning Representations (ICLR) (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=YicbFdNTTy
2021
-
[12]
arXiv preprint arXiv:2205.01580 (2022)
Beyer, L., Zhai, X., Kolesnikov, A.: Better plain ViT baselines for ImageNet-1k. arXiv preprint arXiv:2205.01580 (2022)
-
[13]
In: International Con- ference on Machine Learning (ICML), pp
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J´ egou, H.: Training data- efficient image transformers & distillation through attention. In: International Con- ference on Machine Learning (ICML), pp. 10347–10357 (2021). PMLR
2021
-
[14]
arXiv preprint arXiv:2104.10972 (2021)
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik- Manor, L.: ImageNet-21K pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)
-
[15]
In: Proceedings of the 30th ACM International Conference on Multimedia, pp
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: DiT: Self-supervised pre-training for document image transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022). https:// doi.org/10.1145/3503161.3547911
-
[16]
In: International Con- ference on Machine Learning (ICML), pp
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning (ICML), pp. 8748–8763 (2021). PMLR
2021
-
[17]
Xu, H., Xie, S., Tan, X.E., Huang, P.- Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10
Biswas, B., Bhattacharya, U., Chaudhuri, B.B.: Document image skew detection and correction: A survey. IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10. 1049/ipr2.12876
2023
-
[19]
In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol
Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633 (2007). https://doi. org/10.1109/ICDAR.2007.4376991 . IEEE
-
[20]
Machine Learning45, 5–32 (2001) https://doi.org/10
Breiman, L.: Random forests. Machine Learning45, 5–32 (2001) https://doi.org/10. 1023/A:1010933404324
2001
-
[21]
Ontario, Canada: University of Guelph10, 9 (2011)
Yousefi, J.: Image binarization using otsu thresholding algorithm. Ontario, Canada: University of Guelph10, 9 (2011)
2011
-
[22]
IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692
Hu, M.-K.: Visual pattern recognition by moment invariants. IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692
-
[23]
IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi
Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi. org/10.1109/TSMC.1973.4309314
-
[24]
In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L.,et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol. 32 (2019)
2019
-
[25]
Transformers: State-of- the-art natural language processing
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,et al.: Hugging- Face’s Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): Sys- tem Demonstrations, pp. 38–45 (2020). https: //doi.or...
-
[26]
https://github.com/ufal/ atrium-page-classification
Lutsai, K., Stranak, P., Novak, D., Kri- vankova, D.: ATRIUM’s Page Classifier: Clas- sification of Historical Page Images Using Fine-tuned ViT. https://github.com/ufal/ atrium-page-classification
-
[27]
Rev.” denotes the revision fine-tuned to a specific label set of text features Table Label Summary init 9 Provides the full “initial
Lutsai, K., Krivankova, D.: Annotated Page Images from the (archaeological) Histor- ical Archive. http://hdl.handle.net/20.500. 24 12800/1-5959 Appendix A CLIP Category Descriptions The full suite of eight category description sets used for CLIP fine-tuning and zero-shot evalua- tion is reproduced in the accompanying thesis [4]. Table 8 summarizes all set...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.