pith. sign in

arxiv: 2606.07558 · v1 · pith:HULTUUIXnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI· cs.DL

Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

Pith reviewed 2026-06-29 22:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.DL
keywords historical document classificationimage classificationdeep learningarchival digitizationpage layout analysisRegNetYVision Transformer
0
0 comments X

The pith

Fine-tuned RegNetY-16GF reaches 99.16 percent accuracy on classifying century-old scanned pages into 11 content categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an image classification system to sort large archives of historical documents by visual type so that downstream tools like OCR or table extraction can be applied selectively. A dataset of more than 48,000 pages from Czech archaeological collections was labeled through four rounds of expert review into an 11-class scheme covering text, tables, graphics and other page layouts. Multiple deep-learning models were fine-tuned and compared against a hand-crafted feature baseline. The strongest models exceed 99 percent top-1 accuracy on held-out test pages and assign consistent labels to an additional 649,000 unlabeled pages with over 90 percent agreement between independent networks.

Core claim

Fine-tuned convolutional networks and vision transformers classify historical page images into an 11-category taxonomy with top-1 accuracies above 99 percent on a held-out test set; the same models produce mutually consistent labels on 649,508 unlabeled pages at greater than 90 percent inter-model agreement, while a multimodal CLIP variant shows markedly lower agreement with the image-only models on the unlabeled archive.

What carries the argument

Fine-tuned RegNetY-16GF convolutional network performing 11-way classification of page images.

If this is right

  • Large-scale digitization projects can replace manual page sorting with automated classification at near-perfect accuracy.
  • Content-specific pipelines become practical: OCR runs only on text pages, table extraction only on tabular pages.
  • Image-only models maintain higher consistency on unlabeled archival data than multimodal CLIP variants.
  • The released models, annotated dataset, and software enable direct reuse on other historical collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-tuning recipe could be tested on archives from other languages or time periods to check transferability.
  • High inter-model agreement on unlabeled data suggests the visual features learned are stable across document aging and scanning variations.
  • Deployment could reduce the fraction of pages requiring any human review in future digitization workflows.

Load-bearing premise

The labels produced by the four-stage expert process on the annotated subset remain representative and free of distribution shift when applied to the full century-spanning unlabeled archive.

What would settle it

A random sample of several hundred pages from the 649,508 automatically labeled set that experts re-annotate and find systematic category mismatches or inter-model agreement below 80 percent would falsify the generalization claim.

read the original abstract

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper develops and evaluates image classifiers for 11-category content classification of historical scanned page images from Czech archaeological archives spanning a century. It establishes a ~75% Random Forest baseline on hand-crafted features, then fine-tunes CNNs (EfficientNetV2, RegNetY), transformers (ViT, DiT), and CLIP models on a 48k-image expert-annotated dataset refined over four annotation stages. RegNetY-16GF reaches 99.16% and ViT-large 99.12% Top-1 accuracy on a held-out test set via five-fold cross-validation; the models are applied to 649k unlabeled pages, where image-only models show >90% inter-model agreement while CLIP shows lower agreement. The annotated data, models, and code are released publicly.

Significance. If the reported test-set accuracies and generalization hold, the work supplies a practical, high-accuracy pipeline for automated triage of large heterogeneous historical archives, directly enabling content-specific downstream processing. The public release of the dataset, trained models, and software constitutes a clear strength for reproducibility and follow-on research in document image analysis.

major comments (2)
  1. [Abstract / Conclusion] Abstract and Conclusion: the claim that the fine-tuned models 'deliver reliable labels' on the 649,508 unlabeled pages rests on 99+% held-out accuracy plus >90% inter-model agreement. Inter-model agreement is only a consistency metric and does not rule out correlated errors under distribution shift; no sampling, expert re-annotation, or ground-truth check on any subset of the unlabeled archive is reported to test transfer from the 48k annotated pages.
  2. [Methods] Methods (five-fold cross-validation description): the manuscript supplies no information on whether the test fold was strictly isolated from all hyperparameter tuning, model selection, and data-augmentation decisions, nor on class-imbalance handling or label-noise mitigation during the four-stage annotation process.
minor comments (1)
  1. [Abstract] Abstract: the baseline accuracy is given only as 'approximately 75%'; reporting the precise figure and the feature set used would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where the manuscript requires clarification or adjustment.

read point-by-point responses
  1. Referee: [Abstract / Conclusion] Abstract and Conclusion: the claim that the fine-tuned models 'deliver reliable labels' on the 649,508 unlabeled pages rests on 99+% held-out accuracy plus >90% inter-model agreement. Inter-model agreement is only a consistency metric and does not rule out correlated errors under distribution shift; no sampling, expert re-annotation, or ground-truth check on any subset of the unlabeled archive is reported to test transfer from the 48k annotated pages.

    Authors: We agree that inter-model agreement is a consistency measure and does not by itself rule out systematic errors under distribution shift. The manuscript's conclusion uses the phrasing 'produce consistent labels' rather than asserting ground-truth reliability on the unlabeled set. To avoid any overstatement, we will revise the abstract and conclusion to explicitly frame the 649k labels as inferred from high held-out accuracy and cross-model consistency, and we will add a limitations paragraph noting the absence of direct validation on the unlabeled archive. If feasible within the revision timeline, we will also report a small expert-reviewed subsample of the unlabeled pages. revision: yes

  2. Referee: [Methods] Methods (five-fold cross-validation description): the manuscript supplies no information on whether the test fold was strictly isolated from all hyperparameter tuning, model selection, and data-augmentation decisions, nor on class-imbalance handling or label-noise mitigation during the four-stage annotation process.

    Authors: The five-fold cross-validation was performed with the test fold fully isolated: all hyperparameter search, model selection, and augmentation decisions were made exclusively on the training folds of each split. Class imbalance was addressed via class-weighted cross-entropy loss during fine-tuning. Label noise was mitigated through the four-stage expert annotation pipeline, in which each page received independent review by at least two domain experts with final adjudication on disagreements. We will expand the Methods section with these procedural details and a brief description of the annotation stages. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical held-out evaluation on annotated archive

full rationale

The paper performs supervised fine-tuning of image classifiers on a 48k-page annotated dataset (four-stage expert process) and reports Top-1 accuracy on a held-out test split plus inter-model agreement on 649k unlabeled pages. No equations, derivations, or self-citations are load-bearing; the reported numbers are direct empirical measurements from standard train/test splits and consistency checks. No quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is smuggled in. The work is self-contained against external benchmarks (held-out accuracy) and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claim rests on the assumption that expert multi-stage annotation yields reliable ground truth and that the 11-category scheme captures the visual distinctions needed for downstream processing.

axioms (1)
  • domain assumption The 11-category label scheme designed collaboratively with domain experts accurately reflects distinct visual content types suitable for routing to OCR or structured extraction.
    All reported accuracies and deployment decisions depend on this labeling being both consistent and meaningful for the target use case.

pith-pipeline@v0.9.1-grok · 5882 in / 1383 out tokens · 17727 ms · 2026-06-29T22:48:18.304748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    International Jour- nal on Document Analysis and Recognition (IJDAR)25(4), 305–338 (2022) https://doi

    Nikolaidou, K., Seuret, M., Mokayed, H., Liwicki, M.: A survey of historical doc- ument image datasets. International Jour- nal on Document Analysis and Recognition (IJDAR)25(4), 305–338 (2022) https://doi. org/10.1007/s10032-021-00390-6

  2. [2]

    In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp

    Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: Largest dataset ever for document lay- out analysis. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 1015–1022 (2019). https://doi. org/10.1109/ICDAR.2019.00166 . IEEE

  3. [3]

    In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp

    Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020). https://doi.org/10.1145/ 3394486.3403172

  4. [4]

    Page image classification for content-specific data processing

    Lutsai, K., Straˇ n´ ak, P.: Page image classi- fication for content-specific data processing (2026). https://arxiv.org/abs/2507.21114

  5. [5]

    In: Proceedings of the 29th Annual International ACM SIGIR Confer- ence on Research and Development in Infor- mation Retrieval, pp

    Lewis, D.D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document informa- tion processing. In: Proceedings of the 29th Annual International ACM SIGIR Confer- ence on Research and Development in Infor- mation Retrieval, pp. 665–666 (2006). https: //doi.org/10.1145/1148170.1148307

  6. [6]

    In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp

    Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for doc- ument image classification and retrieval. In: 2015 13th International Conference on Doc- ument Analysis and Recognition (ICDAR), pp. 991–995 (2015). https://doi.org/10.1109/ ICDAR.2015.7333910 . IEEE

  7. [7]

    Neurocomput- ing453, 223–240 (2021) https://doi.org/10

    Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., Suen, C.Y.: Document image classification: Progress over two decades. Neurocomput- ing453, 223–240 (2021) https://doi.org/10. 1016/j.neucom.2021.05.003

  8. [8]

    In: International Conference on Machine Learning (ICML), pp

    Tan, M., Le, Q.V.: EfficientNetV2: Smaller models and faster training. In: International Conference on Machine Learning (ICML), pp. 10096–10106 (2021). PMLR

  9. [9]

    In: International Conference on Machine Learning (ICML), pp

    Tan, M., Le, Q.: EfficientNet: Rethink- ing model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 6105–6114 (2019). PMLR

  10. [10]

    nuScenes: a multimodal dataset for autonomous driving

    Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Doll´ ar, P.: Designing net- work design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10428–10436 (2020). https://doi.org/10. 1109/CVPR42600.2020.01044 23

  11. [11]

    In: International Conference on Learning Representations (ICLR) (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=YicbFdNTTy

  12. [12]

    arXiv preprint arXiv:2205.01580 (2022)

    Beyer, L., Zhai, X., Kolesnikov, A.: Better plain ViT baselines for ImageNet-1k. arXiv preprint arXiv:2205.01580 (2022)

  13. [13]

    In: International Con- ference on Machine Learning (ICML), pp

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J´ egou, H.: Training data- efficient image transformers & distillation through attention. In: International Con- ference on Machine Learning (ICML), pp. 10347–10357 (2021). PMLR

  14. [14]

    arXiv preprint arXiv:2104.10972 (2021)

    Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik- Manor, L.: ImageNet-21K pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)

  15. [15]

    In: Proceedings of the 30th ACM International Conference on Multimedia, pp

    Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: DiT: Self-supervised pre-training for document image transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022). https:// doi.org/10.1145/3503161.3547911

  16. [16]

    In: International Con- ference on Machine Learning (ICML), pp

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning (ICML), pp. 8748–8763 (2021). PMLR

  17. [17]

    Demystifying CLIP Data

    Xu, H., Xie, S., Tan, X.E., Huang, P.- Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)

  18. [18]

    IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10

    Biswas, B., Bhattacharya, U., Chaudhuri, B.B.: Document image skew detection and correction: A survey. IET Image Processing 17(14), 3985–4006 (2023) https://doi.org/10. 1049/ipr2.12876

  19. [19]

    In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol

    Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633 (2007). https://doi. org/10.1109/ICDAR.2007.4376991 . IEEE

  20. [20]

    Machine Learning45, 5–32 (2001) https://doi.org/10

    Breiman, L.: Random forests. Machine Learning45, 5–32 (2001) https://doi.org/10. 1023/A:1010933404324

  21. [21]

    Ontario, Canada: University of Guelph10, 9 (2011)

    Yousefi, J.: Image binarization using otsu thresholding algorithm. Ontario, Canada: University of Guelph10, 9 (2011)

  22. [22]

    IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692

    Hu, M.-K.: Visual pattern recognition by moment invariants. IRE Transactions on Information Theory8(2), 179–187 (1962) https://doi.org/10.1109/TIT.1962.1057692

  23. [23]

    IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi

    Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics (6), 610–621 (1973) https://doi. org/10.1109/TSMC.1973.4309314

  24. [24]

    In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L.,et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Infor- mation Processing Systems (NeurIPS), vol. 32 (2019)

  25. [25]

    Transformers: State-of- the-art natural language processing

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,et al.: Hugging- Face’s Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): Sys- tem Demonstrations, pp. 38–45 (2020). https: //doi.or...

  26. [26]

    https://github.com/ufal/ atrium-page-classification

    Lutsai, K., Stranak, P., Novak, D., Kri- vankova, D.: ATRIUM’s Page Classifier: Clas- sification of Historical Page Images Using Fine-tuned ViT. https://github.com/ufal/ atrium-page-classification

  27. [27]

    Rev.” denotes the revision fine-tuned to a specific label set of text features Table Label Summary init 9 Provides the full “initial

    Lutsai, K., Krivankova, D.: Annotated Page Images from the (archaeological) Histor- ical Archive. http://hdl.handle.net/20.500. 24 12800/1-5959 Appendix A CLIP Category Descriptions The full suite of eight category description sets used for CLIP fine-tuning and zero-shot evalua- tion is reproduced in the accompanying thesis [4]. Table 8 summarizes all set...