Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

Dana K\v{r}iv\'ankov\'a; David Nov\'ak; Kateryna Lutsai; Pavel Stra\v{n}\'ak

arxiv: 2606.07558 · v1 · pith:HULTUUIXnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI· cs.DL

Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

Kateryna Lutsai , Pavel Stra\v{n}\'ak , David Nov\'ak , Dana K\v{r}iv\'ankov\'a This is my paper

Pith reviewed 2026-06-29 22:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.DL

keywords historical document classificationimage classificationdeep learningarchival digitizationpage layout analysisRegNetYVision Transformer

0 comments

The pith

Fine-tuned RegNetY-16GF reaches 99.16 percent accuracy on classifying century-old scanned pages into 11 content categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an image classification system to sort large archives of historical documents by visual type so that downstream tools like OCR or table extraction can be applied selectively. A dataset of more than 48,000 pages from Czech archaeological collections was labeled through four rounds of expert review into an 11-class scheme covering text, tables, graphics and other page layouts. Multiple deep-learning models were fine-tuned and compared against a hand-crafted feature baseline. The strongest models exceed 99 percent top-1 accuracy on held-out test pages and assign consistent labels to an additional 649,000 unlabeled pages with over 90 percent agreement between independent networks.

Core claim

Fine-tuned convolutional networks and vision transformers classify historical page images into an 11-category taxonomy with top-1 accuracies above 99 percent on a held-out test set; the same models produce mutually consistent labels on 649,508 unlabeled pages at greater than 90 percent inter-model agreement, while a multimodal CLIP variant shows markedly lower agreement with the image-only models on the unlabeled archive.

What carries the argument

Fine-tuned RegNetY-16GF convolutional network performing 11-way classification of page images.

If this is right

Large-scale digitization projects can replace manual page sorting with automated classification at near-perfect accuracy.
Content-specific pipelines become practical: OCR runs only on text pages, table extraction only on tabular pages.
Image-only models maintain higher consistency on unlabeled archival data than multimodal CLIP variants.
The released models, annotated dataset, and software enable direct reuse on other historical collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning recipe could be tested on archives from other languages or time periods to check transferability.
High inter-model agreement on unlabeled data suggests the visual features learned are stable across document aging and scanning variations.
Deployment could reduce the fraction of pages requiring any human review in future digitization workflows.

Load-bearing premise

The labels produced by the four-stage expert process on the annotated subset remain representative and free of distribution shift when applied to the full century-spanning unlabeled archive.

What would settle it

A random sample of several hundred pages from the 649,508 automatically labeled set that experts re-annotate and find systematic category mismatches or inter-model agreement below 80 percent would falsify the generalization claim.

read the original abstract

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers a new annotated dataset of Czech historical pages plus strong benchmark numbers from fine-tuned models, with full public release, but the extension to 649k unlabeled pages rests only on inter-model agreement.

read the letter

The punchline is that they built and released a solid annotated dataset of Czech historical pages and got near-perfect accuracy with off-the-shelf fine-tuned models. That's the main value.

They collected over 48,000 pages from century-old archives, went through four rounds of expert annotation for 11 categories like text, tables, graphics. They compare a Random Forest baseline at 75% against CNNs and transformers, with RegNetY-16GF at 99.16% and ViT-large at 99.12% on held-out test via five-fold cross-validation. They also release the models and apply them to 649k unlabeled pages, finding over 90% agreement between the top image models. CLIP does well on test but disagrees more on unlabeled. Public release of data, models, and software is a real plus for reproducibility.

The soft spot is how they justify using these models on the big unlabeled set. The high test accuracy and inter-model agreement are reported, but agreement between similar models doesn't rule out shared mistakes if the unlabeled pages have different visual characteristics due to the century span or scanning variations. No additional expert checks on samples from the unlabeled set are described to confirm the labels hold up. The abstract also skips details on class imbalance or potential label noise in the annotation process.

This work is aimed at digital humanities teams handling large scanned archives who want automated sorting before OCR or extraction. Someone in that area would find the dataset and the performance numbers practical. It shows clear thinking in the setup and baseline, so it deserves peer review to check the methods section and any error analysis.

I'd say send it for review; the core contribution is the dataset and the benchmark, which looks solid enough to warrant referee time.

Referee Report

2 major / 1 minor

Summary. The paper develops and evaluates image classifiers for 11-category content classification of historical scanned page images from Czech archaeological archives spanning a century. It establishes a ~75% Random Forest baseline on hand-crafted features, then fine-tunes CNNs (EfficientNetV2, RegNetY), transformers (ViT, DiT), and CLIP models on a 48k-image expert-annotated dataset refined over four annotation stages. RegNetY-16GF reaches 99.16% and ViT-large 99.12% Top-1 accuracy on a held-out test set via five-fold cross-validation; the models are applied to 649k unlabeled pages, where image-only models show >90% inter-model agreement while CLIP shows lower agreement. The annotated data, models, and code are released publicly.

Significance. If the reported test-set accuracies and generalization hold, the work supplies a practical, high-accuracy pipeline for automated triage of large heterogeneous historical archives, directly enabling content-specific downstream processing. The public release of the dataset, trained models, and software constitutes a clear strength for reproducibility and follow-on research in document image analysis.

major comments (2)

[Abstract / Conclusion] Abstract and Conclusion: the claim that the fine-tuned models 'deliver reliable labels' on the 649,508 unlabeled pages rests on 99+% held-out accuracy plus >90% inter-model agreement. Inter-model agreement is only a consistency metric and does not rule out correlated errors under distribution shift; no sampling, expert re-annotation, or ground-truth check on any subset of the unlabeled archive is reported to test transfer from the 48k annotated pages.
[Methods] Methods (five-fold cross-validation description): the manuscript supplies no information on whether the test fold was strictly isolated from all hyperparameter tuning, model selection, and data-augmentation decisions, nor on class-imbalance handling or label-noise mitigation during the four-stage annotation process.

minor comments (1)

[Abstract] Abstract: the baseline accuracy is given only as 'approximately 75%'; reporting the precise figure and the feature set used would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where the manuscript requires clarification or adjustment.

read point-by-point responses

Referee: [Abstract / Conclusion] Abstract and Conclusion: the claim that the fine-tuned models 'deliver reliable labels' on the 649,508 unlabeled pages rests on 99+% held-out accuracy plus >90% inter-model agreement. Inter-model agreement is only a consistency metric and does not rule out correlated errors under distribution shift; no sampling, expert re-annotation, or ground-truth check on any subset of the unlabeled archive is reported to test transfer from the 48k annotated pages.

Authors: We agree that inter-model agreement is a consistency measure and does not by itself rule out systematic errors under distribution shift. The manuscript's conclusion uses the phrasing 'produce consistent labels' rather than asserting ground-truth reliability on the unlabeled set. To avoid any overstatement, we will revise the abstract and conclusion to explicitly frame the 649k labels as inferred from high held-out accuracy and cross-model consistency, and we will add a limitations paragraph noting the absence of direct validation on the unlabeled archive. If feasible within the revision timeline, we will also report a small expert-reviewed subsample of the unlabeled pages. revision: yes
Referee: [Methods] Methods (five-fold cross-validation description): the manuscript supplies no information on whether the test fold was strictly isolated from all hyperparameter tuning, model selection, and data-augmentation decisions, nor on class-imbalance handling or label-noise mitigation during the four-stage annotation process.

Authors: The five-fold cross-validation was performed with the test fold fully isolated: all hyperparameter search, model selection, and augmentation decisions were made exclusively on the training folds of each split. Class imbalance was addressed via class-weighted cross-entropy loss during fine-tuning. Label noise was mitigated through the four-stage expert annotation pipeline, in which each page received independent review by at least two domain experts with final adjudication on disagreements. We will expand the Methods section with these procedural details and a brief description of the annotation stages. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical held-out evaluation on annotated archive

full rationale

The paper performs supervised fine-tuning of image classifiers on a 48k-page annotated dataset (four-stage expert process) and reports Top-1 accuracy on a held-out test split plus inter-model agreement on 649k unlabeled pages. No equations, derivations, or self-citations are load-bearing; the reported numbers are direct empirical measurements from standard train/test splits and consistency checks. No quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is smuggled in. The work is self-contained against external benchmarks (held-out accuracy) and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claim rests on the assumption that expert multi-stage annotation yields reliable ground truth and that the 11-category scheme captures the visual distinctions needed for downstream processing.

axioms (1)

domain assumption The 11-category label scheme designed collaboratively with domain experts accurately reflects distinct visual content types suitable for routing to OCR or structured extraction.
All reported accuracies and deployment decisions depend on this labeling being both consistent and meaningful for the target use case.

pith-pipeline@v0.9.1-grok · 5882 in / 1383 out tokens · 17727 ms · 2026-06-29T22:48:18.304748+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 15 canonical work pages · 2 internal anchors

[1]

International Jour- nal on Document Analysis and Recognition (IJDAR)25(4), 305–338 (2022) https://doi

Nikolaidou, K., Seuret, M., Mokayed, H., Liwicki, M.: A survey of historical doc- ument image datasets. International Jour- nal on Document Analysis and Recognition (IJDAR)25(4), 305–338 (2022) https://doi. org/10.1007/s10032-021-00390-6

work page doi:10.1007/s10032-021-00390-6 2022
[2]

In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp

Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: Largest dataset ever for document lay- out analysis. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 1015–1022 (2019). https://doi. org/10.1109/ICDAR.2019.00166 . IEEE

work page doi:10.1109/icdar.2019.00166 2019
[3]

In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020). https://doi.org/10.1145/ 3394486.3403172

work page arXiv 2020
[4]

Page image classification for content-specific data processing

Lutsai, K., Straˇ n´ ak, P.: Page image classi- fication for content-specific data processing (2026). https://arxiv.org/abs/2507.21114

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

In: Proceedings of the 29th Annual International ACM SIGIR Confer- ence on Research and Development in Infor- mation Retrieval, pp

Lewis, D.D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document informa- tion processing. In: Proceedings of the 29th Annual International ACM SIGIR Confer- ence on Research and Development in Infor- mation Retrieval, pp. 665–666 (2006). https: //doi.org/10.1145/1148170.1148307

work page doi:10.1145/1148170.1148307 2006
[6]

In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp

Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for doc- ument image classification and retrieval. In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp. 991–995 (2015). https://doi.org/10.1109/ ICDAR.2015.7333910 . IEEE

work page arXiv 2015
[7]

Neurocomput- ing453, 223–240 (2021) https://doi.org/10

Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., Suen, C.Y.: Document image classification: Progress over two decades. Neurocomput- ing453, 223–240 (2021) https://doi.org/10. 1016/j.neucom.2021.05.003

2021
[8]

In: International Conference on Machine Learning (ICML), pp

Tan, M., Le, Q.V.: EfficientNetV2: Smaller models and faster training. In: International Conference on Machine Learning (ICML), pp. 10096–10106 (2021). PMLR

2021
[9]

In: International Conference on Machine Learning (ICML), pp

Tan, M., Le, Q.: EfficientNet: Rethink- ing model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 6105–6114 (2019). PMLR

2019
[10]

nuScenes: a multimodal dataset for autonomous driving

Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Doll´ ar, P.: Designing net- work design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10428–10436 (2020). https://doi.org/10. 1109/CVPR42600.2020.01044 23

work page arXiv 2020
[11]

In: International Conference on Learning Representations (ICLR) (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=YicbFdNTTy

2021
[12]

arXiv preprint arXiv:2205.01580 (2022)

Beyer, L., Zhai, X., Kolesnikov, A.: Better plain ViT baselines for ImageNet-1k. arXiv preprint arXiv:2205.01580 (2022)

work page arXiv 2022
[13]

In: International Con- ference on Machine Learning (ICML), pp

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J´ egou, H.: Training data- efficient image transformers & distillation through attention. In: International Con- ference on Machine Learning (ICML), pp. 10347–10357 (2021). PMLR

2021
[14]

arXiv preprint arXiv:2104.10972 (2021)

Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik- Manor, L.: ImageNet-21K pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)

work page arXiv 2021
[15]

In: Proceedings of the 30th ACM International Conference on Multimedia, pp

Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: DiT: Self-supervised pre-training for document image transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022). https:// doi.org/10.1145/3503161.3547911

work page doi:10.1145/3503161.3547911 2022
[16]

In: International Con- ference on Machine Learning (ICML), pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning (ICML), pp. 8748–8763 (2021). PMLR

2021
[17]

Demystifying CLIP Data

Xu, H., Xie, S., Tan, X.E., Huang, P.- Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10

Biswas, B., Bhattacharya, U., Chaudhuri, B.B.: Document image skew detection and correction: A survey. IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10. 1049/ipr2.12876

2023
[19]

In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol

Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633 (2007). https://doi. org/10.1109/ICDAR.2007.4376991 . IEEE

work page doi:10.1109/icdar.2007.4376991 2007
[20]

Machine Learning45, 5–32 (2001) https://doi.org/10

Breiman, L.: Random forests. Machine Learning45, 5–32 (2001) https://doi.org/10. 1023/A:1010933404324

2001
[21]

Ontario, Canada: University of Guelph10, 9 (2011)

Yousefi, J.: Image binarization using otsu thresholding algorithm. Ontario, Canada: University of Guelph10, 9 (2011)

2011
[22]

IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692

Hu, M.-K.: Visual pattern recognition by moment invariants. IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692

work page doi:10.1109/tit.1962.1057692 1962
[23]

IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi

Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi. org/10.1109/TSMC.1973.4309314

work page doi:10.1109/tsmc.1973.4309314 1973
[24]

In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L.,et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol. 32 (2019)

2019
[25]

Transformers: State-of- the-art natural language processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,et al.: Hugging- Face’s Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): Sys- tem Demonstrations, pp. 38–45 (2020). https: //doi.or...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[26]

https://github.com/ufal/ atrium-page-classification

Lutsai, K., Stranak, P., Novak, D., Kri- vankova, D.: ATRIUM’s Page Classifier: Clas- sification of Historical Page Images Using Fine-tuned ViT. https://github.com/ufal/ atrium-page-classification
[27]

Rev.” denotes the revision fine-tuned to a specific label set of text features Table Label Summary init 9 Provides the full “initial

Lutsai, K., Krivankova, D.: Annotated Page Images from the (archaeological) Histor- ical Archive. http://hdl.handle.net/20.500. 24 12800/1-5959 Appendix A CLIP Category Descriptions The full suite of eight category description sets used for CLIP fine-tuning and zero-shot evalua- tion is reproduced in the accompanying thesis [4]. Table 8 summarizes all set...

[1] [1]

International Jour- nal on Document Analysis and Recognition (IJDAR)25(4), 305–338 (2022) https://doi

Nikolaidou, K., Seuret, M., Mokayed, H., Liwicki, M.: A survey of historical doc- ument image datasets. International Jour- nal on Document Analysis and Recognition (IJDAR)25(4), 305–338 (2022) https://doi. org/10.1007/s10032-021-00390-6

work page doi:10.1007/s10032-021-00390-6 2022

[2] [2]

In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp

Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: Largest dataset ever for document lay- out analysis. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 1015–1022 (2019). https://doi. org/10.1109/ICDAR.2019.00166 . IEEE

work page doi:10.1109/icdar.2019.00166 2019

[3] [3]

In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020). https://doi.org/10.1145/ 3394486.3403172

work page arXiv 2020

[4] [4]

Page image classification for content-specific data processing

Lutsai, K., Straˇ n´ ak, P.: Page image classi- fication for content-specific data processing (2026). https://arxiv.org/abs/2507.21114

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

In: Proceedings of the 29th Annual International ACM SIGIR Confer- ence on Research and Development in Infor- mation Retrieval, pp

Lewis, D.D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document informa- tion processing. In: Proceedings of the 29th Annual International ACM SIGIR Confer- ence on Research and Development in Infor- mation Retrieval, pp. 665–666 (2006). https: //doi.org/10.1145/1148170.1148307

work page doi:10.1145/1148170.1148307 2006

[6] [6]

In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp

Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for doc- ument image classification and retrieval. In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp. 991–995 (2015). https://doi.org/10.1109/ ICDAR.2015.7333910 . IEEE

work page arXiv 2015

[7] [7]

Neurocomput- ing453, 223–240 (2021) https://doi.org/10

Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., Suen, C.Y.: Document image classification: Progress over two decades. Neurocomput- ing453, 223–240 (2021) https://doi.org/10. 1016/j.neucom.2021.05.003

2021

[8] [8]

In: International Conference on Machine Learning (ICML), pp

Tan, M., Le, Q.V.: EfficientNetV2: Smaller models and faster training. In: International Conference on Machine Learning (ICML), pp. 10096–10106 (2021). PMLR

2021

[9] [9]

In: International Conference on Machine Learning (ICML), pp

Tan, M., Le, Q.: EfficientNet: Rethink- ing model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 6105–6114 (2019). PMLR

2019

[10] [10]

nuScenes: a multimodal dataset for autonomous driving

Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Doll´ ar, P.: Designing net- work design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10428–10436 (2020). https://doi.org/10. 1109/CVPR42600.2020.01044 23

work page arXiv 2020

[11] [11]

In: International Conference on Learning Representations (ICLR) (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=YicbFdNTTy

2021

[12] [12]

arXiv preprint arXiv:2205.01580 (2022)

Beyer, L., Zhai, X., Kolesnikov, A.: Better plain ViT baselines for ImageNet-1k. arXiv preprint arXiv:2205.01580 (2022)

work page arXiv 2022

[13] [13]

In: International Con- ference on Machine Learning (ICML), pp

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J´ egou, H.: Training data- efficient image transformers & distillation through attention. In: International Con- ference on Machine Learning (ICML), pp. 10347–10357 (2021). PMLR

2021

[14] [14]

arXiv preprint arXiv:2104.10972 (2021)

Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik- Manor, L.: ImageNet-21K pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)

work page arXiv 2021

[15] [15]

In: Proceedings of the 30th ACM International Conference on Multimedia, pp

Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: DiT: Self-supervised pre-training for document image transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022). https:// doi.org/10.1145/3503161.3547911

work page doi:10.1145/3503161.3547911 2022

[16] [16]

In: International Con- ference on Machine Learning (ICML), pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning (ICML), pp. 8748–8763 (2021). PMLR

2021

[17] [17]

Demystifying CLIP Data

Xu, H., Xie, S., Tan, X.E., Huang, P.- Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10

Biswas, B., Bhattacharya, U., Chaudhuri, B.B.: Document image skew detection and correction: A survey. IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10. 1049/ipr2.12876

2023

[19] [19]

In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol

Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633 (2007). https://doi. org/10.1109/ICDAR.2007.4376991 . IEEE

work page doi:10.1109/icdar.2007.4376991 2007

[20] [20]

Machine Learning45, 5–32 (2001) https://doi.org/10

Breiman, L.: Random forests. Machine Learning45, 5–32 (2001) https://doi.org/10. 1023/A:1010933404324

2001

[21] [21]

Ontario, Canada: University of Guelph10, 9 (2011)

Yousefi, J.: Image binarization using otsu thresholding algorithm. Ontario, Canada: University of Guelph10, 9 (2011)

2011

[22] [22]

IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692

Hu, M.-K.: Visual pattern recognition by moment invariants. IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692

work page doi:10.1109/tit.1962.1057692 1962

[23] [23]

IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi

Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi. org/10.1109/TSMC.1973.4309314

work page doi:10.1109/tsmc.1973.4309314 1973

[24] [24]

In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L.,et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol. 32 (2019)

2019

[25] [25]

Transformers: State-of- the-art natural language processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,et al.: Hugging- Face’s Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): Sys- tem Demonstrations, pp. 38–45 (2020). https: //doi.or...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[26] [26]

https://github.com/ufal/ atrium-page-classification

Lutsai, K., Stranak, P., Novak, D., Kri- vankova, D.: ATRIUM’s Page Classifier: Clas- sification of Historical Page Images Using Fine-tuned ViT. https://github.com/ufal/ atrium-page-classification

[27] [27]

Rev.” denotes the revision fine-tuned to a specific label set of text features Table Label Summary init 9 Provides the full “initial

Lutsai, K., Krivankova, D.: Annotated Page Images from the (archaeological) Histor- ical Archive. http://hdl.handle.net/20.500. 24 12800/1-5959 Appendix A CLIP Category Descriptions The full suite of eight category description sets used for CLIP fine-tuning and zero-shot evalua- tion is reproduced in the accompanying thesis [4]. Table 8 summarizes all set...