Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

Alexander L\"uck; Christian Sch\"on; Didier Stricker; Nick Jochum; Ren\'e Schuster; Tobias Alt-Veit

arxiv: 2606.17644 · v1 · pith:4IBHLXR7new · submitted 2026-06-16 · 💻 cs.CV · cs.AI

Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

Nick Jochum , Tobias Alt-Veit , Christian Sch\"on , Alexander L\"uck , Ren\'e Schuster , Didier Stricker This is my paper

Pith reviewed 2026-06-27 01:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords bounding box label propagationdocument layout analysissemi-supervised learningobject detectionpseudo-labellingre-annotationD4LA dataset

0 comments

The pith

Bounding Box Label Propagation re-annotates document layout datasets by propagating class labels from 10% labelled data to reach 81.6% of fully supervised mAP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to cut re-annotation costs in growing document datasets by manually labelling only a small subset and then automatically assigning classes to the rest. It builds a joint embedding from visual, textual, and positional features of bounding boxes so that standard label propagation can fill in the missing labels. A reader would care because document processing pipelines constantly need updated class labels as new pages arrive and annotation schemes evolve. The approach is presented as a plug-and-play addition to existing object detectors rather than a full retraining pipeline.

Core claim

Bounding Box Label Propagation (BBLP) is a pseudo-labelling framework for object detection that encodes each bounding box with combined visual, textual, and positional embeddings to form a joint space; label propagation is then run directly in that space on partially annotated document-layout datasets, producing class labels whose quality reaches an mAP of 54.0% on D4LA (81.6% of the fully supervised figure) when only 10% of the boxes are initially labelled.

What carries the argument

An object encoder that fuses visual, textual, and positional embeddings of each bounding box into a single vector for use in label propagation.

If this is right

High-quality class labels can be produced for the great majority of bounding boxes without manual review.
Re-annotation effort for evolving document-layout datasets drops sharply when only a small labelled seed is maintained.
The same encoder-plus-propagation pipeline can be attached to any object detector without changing its training procedure.
Performance at 10% labelled data already recovers more than four-fifths of the accuracy obtained with complete supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding construction could be tried on non-document object-detection tasks where datasets also grow incrementally.
Replacing the standard propagation step with more recent semi-supervised variants might raise the recovered fraction above 81.6%.
Measuring how performance changes when the labelled fraction drops below 10% would show the practical lower limit of the method.

Load-bearing premise

The combined visual-textual-positional embedding space is sufficiently well-structured that ordinary label propagation assigns the correct class to each unlabelled bounding box.

What would settle it

On the D4LA dataset with exactly 10% of boxes labelled, measure the mAP obtained after running the full BBLP pipeline; if the result falls substantially below 54.0% or well under 81.6% of the fully supervised mAP, the claim does not hold.

Figures

Figures reproduced from arXiv: 2606.17644 by Alexander L\"uck, Christian Sch\"on, Didier Stricker, Nick Jochum, Ren\'e Schuster, Tobias Alt-Veit.

**Figure 1.** Figure 1: BBLP framework overview. For all extracted object regions, the proposed multi-modal Layout Object Encoder embeds all layout object into a common vector space. Label Propagation then exploits this representation and generates pseudo-labels for unlabelled objects by transferring label information from the labelled objects. re-label all objects in the given dataset, based on labels provided for a small subset… view at source ↗

**Figure 2.** Figure 2: Overview of the Layout Object Encoder architecture. The encoder maps visual, textual, and positional representations of a layout object to a unified embedding. During training, the harmonization head provides supervision through a surrogate classification task. layout analysis dataset [1]. For BBLP, we use the LOE checkpoint with the highest pseudo-label accuracy obtained during validation. This selection … view at source ↗

**Figure 3.** Figure 3: Training performance of a DINO object detector trained on BBLP pseudo-labels. For each evaluation dataset, curves show mAP scores on the test split during training for models trained on pseudo-labelled, 10%, and 100% ground-truth annotated documents. features baselines in all settings by 6 and 9 percentage points, respectively. Improvements on PubLayNet10% are notably smaller, as both baselines consistent… view at source ↗

read the original abstract

Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BBLP adapts label propagation to bounding box re-annotation via a multi-modal encoder, delivering usable numbers on D4LA but with thin validation on whether the embedding actually works.

read the letter

The core move here is taking label propagation, which is routine for image classification, and extending it to re-label unannotated bounding boxes in document layout datasets. They encode each box with visual, textual, and positional features, build a joint space, and propagate from a small labeled subset. That is the actual novelty claimed, and it matches the abstract's statement that no prior adaptation existed for object-detection re-annotation.

The practical result is the main strength: on D4LA with only 10% labels they reach 54 mAP, or 81.6% of the fully supervised baseline. For teams that maintain growing document collections and want to avoid full re-labeling, that level of recovery is worth noticing.

The soft spot is the missing check on the embedding space itself. Document classes are visually similar, so propagation succeeds only if the joint features cluster by semantic class rather than by position or low-level appearance. The abstract supplies no nearest-neighbor accuracy, component ablations, or separation diagnostics, and the stress-test concern about error propagation therefore still stands. Without those controls it is difficult to tell whether the method is robust or just lucky on this dataset. The experimental description is also light on protocol, baselines, and variance.

This is for document-analysis practitioners who need to stretch annotation budgets. A general semi-supervised detection researcher would probably read the method section once and move on. The work is coherent on its own terms and the numbers are high enough to justify referee time, so I would send it out for review rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The paper proposes Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for re-annotating object detection instances in document layout analysis datasets. An object encoder combines visual, textual, and positional embeddings into a joint space on which standard label propagation is applied in a plug-and-play manner to partially annotated bounding boxes. On the D4LA dataset the method reports 54.0% mAP (81.6% of fully-supervised performance) using only 10% labelled data.

Significance. If the central result is reproducible and the embedding separation is shown to be the operative mechanism, the work would offer a practical route to lowering re-annotation costs for continuously growing document-layout corpora by adapting label-propagation ideas from classification to detection.

major comments (2)

[Abstract] Abstract: the headline claim (54.0% mAP = 81.6% of fully supervised with 10% labels) rests on the untested premise that the joint visual+textual+positional embedding produces a metric space in which label propagation reliably recovers correct layout classes rather than propagating errors; no nearest-neighbour accuracy, t-SNE separation, or component ablation is supplied to support this condition for visually confusable document classes.
[Methods / Evaluation] Methods / Evaluation: the experimental protocol (selection of the 10% labelled subset, choice of propagation algorithm and graph construction, baseline comparisons, error bars, and validation that the encoder was not trained with class supervision) is absent from the abstract and therefore cannot be assessed; without these details the reported recovery rate cannot be attributed to the proposed embedding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions that will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim (54.0% mAP = 81.6% of fully supervised with 10% labels) rests on the untested premise that the joint visual+textual+positional embedding produces a metric space in which label propagation reliably recovers correct layout classes rather than propagating errors; no nearest-neighbour accuracy, t-SNE separation, or component ablation is supplied to support this condition for visually confusable document classes.

Authors: We agree that the abstract presents the headline result without direct supporting analyses of the embedding space. In the revised manuscript we will add nearest-neighbour accuracy figures, t-SNE visualizations of the joint embedding, and component-wise ablations (visual, textual, positional) to the experimental section. These additions will demonstrate class separation and justify the use of label propagation on the learned metric. revision: yes
Referee: [Methods / Evaluation] Methods / Evaluation: the experimental protocol (selection of the 10% labelled subset, choice of propagation algorithm and graph construction, baseline comparisons, error bars, and validation that the encoder was not trained with class supervision) is absent from the abstract and therefore cannot be assessed; without these details the reported recovery rate cannot be attributed to the proposed embedding.

Authors: We acknowledge that the abstract omits these protocol details. We will revise the abstract to include a concise description of the 10% subset selection procedure, the label-propagation algorithm and graph construction, the baselines used, and explicit confirmation that the object encoder was trained without class supervision. Error bars will be reported where applicable. The full protocol already appears in Sections 3 and 4; the revision will make these elements visible at the abstract level so that the contribution can be properly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity; plug-and-play framework with empirical evaluation on external dataset

full rationale

The paper introduces BBLP as a semi-supervised pseudo-labelling framework that combines an object encoder (visual + textual + positional embeddings) with standard label propagation. No equations, fitted parameters, or self-referential definitions appear in the abstract or described method. The headline result (54.0% mAP on D4LA with 10% labels) is an empirical measurement against a fully-supervised baseline, not a quantity derived by construction from the authors' own prior fits or self-citations. The derivation chain is self-contained against external benchmarks and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, training details, or modelling choices are visible, so free parameters, axioms, and invented entities cannot be enumerated. Full manuscript required for ledger construction.

pith-pipeline@v0.9.1-grok · 5768 in / 1059 out tokens · 36430 ms · 2026-06-27T01:53:04.952976+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 3 canonical work pages · 2 internal anchors

[1]

In: Proc

Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: Proc. 10th Int. Conf. on Document Analysis and Recognition. pp. 296–300. Barcelona, Spain (Jul 2009)

2009
[2]

Banerjee, A., Biswas, S., Lladós, J., Pal, U.: SemiDocSeg: Harnessing semi- supervised learning for document layout analysis. Int. J. on Document Analysis and Recognition27(3), 317–334 (Jun 2024)

2024
[3]

In: Proc

Beyer, L., Izmailov, P., Kolesnikov, A., Caron, M., Kornblith, S., Zhai, X., Minderer, M., Tschannen, M., Alabdulmohsin, I., Pavetic, F.: Flexivit: One model for all patch sizes. In: Proc. 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 14496–14506. Vancouver, BC, Canada (Jun 2023)

2023
[4]

ACM Comput

Binmakhashen, G.M., Mahmoud, S.A.: Document layout analysis: A comprehensive survey. ACM Comput. Surv.52(6), article no. 109 (Oct 2019)

2019
[5]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision - ECCV 2020, Lecture Notes in Computer Science, vol. 12346, pp. 213–229. Springer, Cham (2020)

2020
[6]

The MIT Press (2006)

Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press (2006)

2006
[7]

In: Proc

Da, C., Luo, C., Zheng, Q., Yao, C.: Vision grid transformer for document layout analysis. In: Proc. 2023 IEEE/CVF Int. Conf. on Computer Vision. pp. 19405–19415. Paris, France (Oct 2023)

2023
[8]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I.M., Oliver, A., Padlewski, P., Gritsenko, A.A., Lucic, M., Houlsby, N.: Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine,...

2023
[9]

In: Proc

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021)

2021
[10]

van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (Nov 2019)

2019
[11]

Gemelli, A., Marinai, S., Pisaneschi, L., Santoni, F.: Datasets and annotations for layout analysis of scientific articles. Int. J. on Document Analysis and Recognition 27(4), 683–705 (Nov 2024)

2024
[12]

In: Proc

Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. 2014 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 580–587. Columbus, OH (Jun 2014)

2014
[13]

In: Proc

Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proc. 2006 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 1735–1742. New York, NY (Jun 2006)

2006
[14]

In: Proc

Huang, Q., He, H., Singh, A., Lim, S.N., Benson, A.R.: Combining label propagation and simple models out-performs graph neural networks. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021) 16 N. Jochum et al

2021
[15]

In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L

Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: Pre-training for document AI with unified text and image masking. In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L. (eds.) Proc. 30th ACM Int. Conf. on Multimedia. pp. 4083–4091. Lisbon, Portugal (Oct 2022)

2022
[16]

In: Proc

Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Label propagation for deep semi- supervised learning. In: Proc. 2019 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 5070–5079. Long Beach, CA (Jun 2019)

2019
[17]

Future Internet14(6), article no

Kallempudi, G., Hashmi, K.A., Pagani, A., Afzal, M.Z., Stricker, D.: Toward semi- supervised graphical object detection in document images. Future Internet14(6), article no. 176 (june 2022)

2022
[18]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Li, G., Li, X., Wang, Y., Wu, Y., Liang, D., Zhang, S.: PseCo: Pseudo labeling and consistency training for semi-supervised object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, Lecture Notes in Computer Science, vol. 13669, pp. 457–472. Springer, Cham (2022)

2022
[19]

In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L

Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: DiT: Self-supervised pre-training for document image transformer. In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L. (eds.) Proc. 30th ACM Int. Conf. on Multimedia. pp. 3530–3539. Lisbon, Portugal (Oct 2022)

2022
[20]

In: Scott, D., Bel, N., Zong, C

Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: DocBank: A bench- mark dataset for document layout analysis. In: Scott, D., Bel, N., Zong, C. (eds.) Proc. 28th Int. Conf. on Computational Linguistics. pp. 949–960. Barcelona, Spain (Dec 2020)

2020
[21]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision - ECCV 2014, Lecture Notes in Computer Science, vol. 8693, pp. 740–755. Springer, Cham (2014)

2014
[22]

IEEE Trans

Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell.42(2), 318–327 (Feb 2020)

2020
[23]

In: Proc

Liu, Y.C., Ma, C.Y., He, Z., Kuo, C.W., Chen, K., Zhang, P., Wu, B., Kira, Z., Vajda, P.: Unbiased teacher for semi-supervised object detection. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021)

2021
[24]

In: Proc

Nassar, A., Livanthinos, N., Lysak, M., Staar, P.: TableFormer: Table structure understanding with transformers. In: Proc. 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 4604–4613. New Orleans, LA (Jun 2022)

2022
[25]

In: Zhang, A., Rang- wala, H

Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.: DocLayNet: A large human-annotated dataset for document-layout segmentation. In: Zhang, A., Rang- wala, H. (eds.) Proc. 28th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. pp. 3743–3751. Washington, DC (Aug 2022)

2022
[26]

In: Proc

Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 779–788. Las Vegas, NV (Jun 2016)

2016
[27]

In: Proc

Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. 29th Int. Conf. on Neural Infor- mation Processing Systems. Advances in Neural Information Processing Systems, vol. 28, pp. 91–99. Montreal, QC, Canada (Dec 2015)

2015
[28]

In: Proc

Rizve, M.N., Duarte, K., Rawat, Y.S., Shah, M.: In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021)

2021
[29]

In: Proc

Smith, R.: An overview of the tesseract OCR engine. In: Proc. 9th Int. Conf. on Document Analysis and Recognition. pp. 629–633. Curitiba, Brazil (Sep 2007) Bounding Box Label Propagation 17

2007
[30]

arXiv:2005.04757v2 [cs.CV] (2020)

Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi- supervised learning framework for object detection. arXiv:2005.04757v2 [cs.CV] (2020)

work page arXiv 2005
[31]

In: Proc

Tang, Y., Chen, W., Luo, Y., Zhang, Y.: Humble teachers teach better students for semi-supervised object detection. In: Proc. 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 3132–3141. virtual (Jun 2021)

2021
[32]

In: Proc

Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., Zeng, M., Zhang, C., Bansal, M.: Unifying vision, text, and layout for universal document processing. In: Proc. 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 19254–19264. Vancouver, BC, Canada (Jun 2023)

2023
[33]

In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R

Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Proc. 31st Int. Conf. on Neural Information Processing Systems. Advances in Neural Information P...

2017
[34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv:2502.14786 [cs.CV] (Feb 2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

In: Smith, E.H.B., Liwicki, M., Peng, L

Wang, J., Hu, K., Huo, Q.: DLAFormer: An end-to-end transformer for document layout analysis. In: Smith, E.H.B., Liwicki, M., Peng, L. (eds.) Document Analysis and Recognition - ICDAR 2024, Lecture Notes in Computer Science, vol. 14807, pp. 40–57. Springer, Cham (2024)

2024
[36]

Multilingual E5 Text Embeddings: A Technical Report

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., Wei, F.: Multilingual E5 text embeddings: A technical report. arXiv:2402.05672v1 [cs.CL] (Feb 2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

In: Proc

Xu, M., Zhang, Z., Hu, H., Wang, J., Wang, L., Wei, F., Bai, X., Liu, Z.: End-to-end semi-supervised object detection with soft teacher. In: Proc. 2021 IEEE/CVF Int. Conf. on Computer Vision. pp. 3040–3049. Montreal, Canada (Oct 2021)

2021
[38]

In: Proc

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proc. 26th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. pp. 1192–1200. virtual (Aug 2020)

2020
[39]

In: Proc

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In: Proc. 11th Int. Conf. on Learning Representations. Kigali, Rwanda (May 2023)

2023
[40]

In: Proc

Zhong, X., Tang, J., Jimeno-Yepes, A.: PubLayNet: Largest dataset ever for docu- ment layout analysis. In: Proc. 2019 Int. Conf. on Document Analysis and Recogni- tion. pp. 1015–1022. Sydney, Australia (Sep 2019)

2019
[41]

In: Thrun, S., Saul, L.K., Schölkopf, B

Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Proc. 17th Int. Conf. on Neural Information Processing Systems. Advances in Neural Information Processing Systems, vol. 16, pp. 321–328. Vancouver and Whistler, Canada (Dec 2003)

2003
[42]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Zhou, H., Ge, Z., Liu, S., Mao, W., Li, Z., Yu, H., Sun, J.: Dense teacher: Dense pseudo-labels for semi-supervised object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, Lecture Notes in Computer Science, vol. 13669, pp. 35–50. Springer, Cham (2022)

2022
[43]

Technical report (2002)

Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical report (2002)

2002

[1] [1]

In: Proc

Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: Proc. 10th Int. Conf. on Document Analysis and Recognition. pp. 296–300. Barcelona, Spain (Jul 2009)

2009

[2] [2]

Banerjee, A., Biswas, S., Lladós, J., Pal, U.: SemiDocSeg: Harnessing semi- supervised learning for document layout analysis. Int. J. on Document Analysis and Recognition27(3), 317–334 (Jun 2024)

2024

[3] [3]

In: Proc

Beyer, L., Izmailov, P., Kolesnikov, A., Caron, M., Kornblith, S., Zhai, X., Minderer, M., Tschannen, M., Alabdulmohsin, I., Pavetic, F.: Flexivit: One model for all patch sizes. In: Proc. 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 14496–14506. Vancouver, BC, Canada (Jun 2023)

2023

[4] [4]

ACM Comput

Binmakhashen, G.M., Mahmoud, S.A.: Document layout analysis: A comprehensive survey. ACM Comput. Surv.52(6), article no. 109 (Oct 2019)

2019

[5] [5]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision - ECCV 2020, Lecture Notes in Computer Science, vol. 12346, pp. 213–229. Springer, Cham (2020)

2020

[6] [6]

The MIT Press (2006)

Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press (2006)

2006

[7] [7]

In: Proc

Da, C., Luo, C., Zheng, Q., Yao, C.: Vision grid transformer for document layout analysis. In: Proc. 2023 IEEE/CVF Int. Conf. on Computer Vision. pp. 19405–19415. Paris, France (Oct 2023)

2023

[8] [8]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I.M., Oliver, A., Padlewski, P., Gritsenko, A.A., Lucic, M., Houlsby, N.: Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine,...

2023

[9] [9]

In: Proc

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021)

2021

[10] [10]

van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (Nov 2019)

2019

[11] [11]

Gemelli, A., Marinai, S., Pisaneschi, L., Santoni, F.: Datasets and annotations for layout analysis of scientific articles. Int. J. on Document Analysis and Recognition 27(4), 683–705 (Nov 2024)

2024

[12] [12]

In: Proc

Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. 2014 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 580–587. Columbus, OH (Jun 2014)

2014

[13] [13]

In: Proc

Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proc. 2006 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 1735–1742. New York, NY (Jun 2006)

2006

[14] [14]

In: Proc

Huang, Q., He, H., Singh, A., Lim, S.N., Benson, A.R.: Combining label propagation and simple models out-performs graph neural networks. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021) 16 N. Jochum et al

2021

[15] [15]

In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L

Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: Pre-training for document AI with unified text and image masking. In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L. (eds.) Proc. 30th ACM Int. Conf. on Multimedia. pp. 4083–4091. Lisbon, Portugal (Oct 2022)

2022

[16] [16]

In: Proc

Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Label propagation for deep semi- supervised learning. In: Proc. 2019 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 5070–5079. Long Beach, CA (Jun 2019)

2019

[17] [17]

Future Internet14(6), article no

Kallempudi, G., Hashmi, K.A., Pagani, A., Afzal, M.Z., Stricker, D.: Toward semi- supervised graphical object detection in document images. Future Internet14(6), article no. 176 (june 2022)

2022

[18] [18]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Li, G., Li, X., Wang, Y., Wu, Y., Liang, D., Zhang, S.: PseCo: Pseudo labeling and consistency training for semi-supervised object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, Lecture Notes in Computer Science, vol. 13669, pp. 457–472. Springer, Cham (2022)

2022

[19] [19]

In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L

Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: DiT: Self-supervised pre-training for document image transformer. In: Magalhães, J., Bimbo, A.D., Satoh, S., Sebe, N., Alameda-Pineda, X., Jin, Q., Oria, V., Toni, L. (eds.) Proc. 30th ACM Int. Conf. on Multimedia. pp. 3530–3539. Lisbon, Portugal (Oct 2022)

2022

[20] [20]

In: Scott, D., Bel, N., Zong, C

Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: DocBank: A bench- mark dataset for document layout analysis. In: Scott, D., Bel, N., Zong, C. (eds.) Proc. 28th Int. Conf. on Computational Linguistics. pp. 949–960. Barcelona, Spain (Dec 2020)

2020

[21] [21]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision - ECCV 2014, Lecture Notes in Computer Science, vol. 8693, pp. 740–755. Springer, Cham (2014)

2014

[22] [22]

IEEE Trans

Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell.42(2), 318–327 (Feb 2020)

2020

[23] [23]

In: Proc

Liu, Y.C., Ma, C.Y., He, Z., Kuo, C.W., Chen, K., Zhang, P., Wu, B., Kira, Z., Vajda, P.: Unbiased teacher for semi-supervised object detection. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021)

2021

[24] [24]

In: Proc

Nassar, A., Livanthinos, N., Lysak, M., Staar, P.: TableFormer: Table structure understanding with transformers. In: Proc. 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 4604–4613. New Orleans, LA (Jun 2022)

2022

[25] [25]

In: Zhang, A., Rang- wala, H

Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.: DocLayNet: A large human-annotated dataset for document-layout segmentation. In: Zhang, A., Rang- wala, H. (eds.) Proc. 28th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. pp. 3743–3751. Washington, DC (Aug 2022)

2022

[26] [26]

In: Proc

Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition. pp. 779–788. Las Vegas, NV (Jun 2016)

2016

[27] [27]

In: Proc

Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. 29th Int. Conf. on Neural Infor- mation Processing Systems. Advances in Neural Information Processing Systems, vol. 28, pp. 91–99. Montreal, QC, Canada (Dec 2015)

2015

[28] [28]

In: Proc

Rizve, M.N., Duarte, K., Rawat, Y.S., Shah, M.: In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: Proc. 9th Int. Conf. on Learning Representations. Vienna, Austria (May 2021)

2021

[29] [29]

In: Proc

Smith, R.: An overview of the tesseract OCR engine. In: Proc. 9th Int. Conf. on Document Analysis and Recognition. pp. 629–633. Curitiba, Brazil (Sep 2007) Bounding Box Label Propagation 17

2007

[30] [30]

arXiv:2005.04757v2 [cs.CV] (2020)

Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi- supervised learning framework for object detection. arXiv:2005.04757v2 [cs.CV] (2020)

work page arXiv 2005

[31] [31]

In: Proc

Tang, Y., Chen, W., Luo, Y., Zhang, Y.: Humble teachers teach better students for semi-supervised object detection. In: Proc. 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 3132–3141. virtual (Jun 2021)

2021

[32] [32]

In: Proc

Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., Zeng, M., Zhang, C., Bansal, M.: Unifying vision, text, and layout for universal document processing. In: Proc. 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. pp. 19254–19264. Vancouver, BC, Canada (Jun 2023)

2023

[33] [33]

In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R

Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Proc. 31st Int. Conf. on Neural Information Processing Systems. Advances in Neural Information P...

2017

[34] [34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv:2502.14786 [cs.CV] (Feb 2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

In: Smith, E.H.B., Liwicki, M., Peng, L

Wang, J., Hu, K., Huo, Q.: DLAFormer: An end-to-end transformer for document layout analysis. In: Smith, E.H.B., Liwicki, M., Peng, L. (eds.) Document Analysis and Recognition - ICDAR 2024, Lecture Notes in Computer Science, vol. 14807, pp. 40–57. Springer, Cham (2024)

2024

[36] [36]

Multilingual E5 Text Embeddings: A Technical Report

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., Wei, F.: Multilingual E5 text embeddings: A technical report. arXiv:2402.05672v1 [cs.CL] (Feb 2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

In: Proc

Xu, M., Zhang, Z., Hu, H., Wang, J., Wang, L., Wei, F., Bai, X., Liu, Z.: End-to-end semi-supervised object detection with soft teacher. In: Proc. 2021 IEEE/CVF Int. Conf. on Computer Vision. pp. 3040–3049. Montreal, Canada (Oct 2021)

2021

[38] [38]

In: Proc

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proc. 26th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. pp. 1192–1200. virtual (Aug 2020)

2020

[39] [39]

In: Proc

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In: Proc. 11th Int. Conf. on Learning Representations. Kigali, Rwanda (May 2023)

2023

[40] [40]

In: Proc

Zhong, X., Tang, J., Jimeno-Yepes, A.: PubLayNet: Largest dataset ever for docu- ment layout analysis. In: Proc. 2019 Int. Conf. on Document Analysis and Recogni- tion. pp. 1015–1022. Sydney, Australia (Sep 2019)

2019

[41] [41]

In: Thrun, S., Saul, L.K., Schölkopf, B

Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Proc. 17th Int. Conf. on Neural Information Processing Systems. Advances in Neural Information Processing Systems, vol. 16, pp. 321–328. Vancouver and Whistler, Canada (Dec 2003)

2003

[42] [42]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Zhou, H., Ge, Z., Liu, S., Mao, W., Li, Z., Yu, H., Sun, J.: Dense teacher: Dense pseudo-labels for semi-supervised object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, Lecture Notes in Computer Science, vol. 13669, pp. 35–50. Springer, Cham (2022)

2022

[43] [43]

Technical report (2002)

Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical report (2002)

2002