pith. sign in

arxiv: 1907.07023 · v1 · pith:Y5DX47CCnew · submitted 2019-07-16 · 💻 cs.CV · cs.LG

Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision

Pith reviewed 2026-05-24 20:53 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords semantic segmentationweak supervisiondata selectionGaussian mixture modelobject diversityCityscapesOpen Imagesautomated driving
0
0 comments X

The pith

Selecting subsets of weakly labeled images lets semantic segmentation networks match full-set accuracy with up to 100 times less data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two selection methods to identify the most useful images that carry only bounding-box labels when training per-pixel semantic segmentation networks. The first models visual appearance of images through a Gaussian mixture to locate similar examples without using any labels. The second counts distinct objects inside the boxes to favor scenes with high variety. Tests on Cityscapes driving scenes and Open Images show that networks trained on the chosen small subsets reach the same accuracy as those trained on the entire weak collection. This approach matters because pixel-level labels are costly to obtain, so trimming the weak data volume lowers the overall supervision burden while preserving performance.

Core claim

Modeling image representations with a Gaussian Mixture Model finds visually similar images, while counting object instances from bounding boxes finds diverse images; both criteria select small subsets of weakly labeled data that train semantic segmentation CNNs to the same accuracy level as the full sets, enabling reductions of up to 100 times on Open Images and 20 times on Cityscapes.

What carries the argument

Gaussian Mixture Model fitted to image feature representations for similarity-based selection, together with object-count diversity measured from bounding boxes; these act as filters that reduce the weak training set before the segmentation network is trained.

If this is right

  • The GMM method requires no labels at all, only raw image features.
  • The diversity method needs only the bounding-box annotations already present.
  • Accuracy stays level even after cutting the weak data volume by the reported factors on both datasets.
  • GMM fitting also yields direct descriptions of the underlying image distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two selection rules could be applied together to form even smaller yet still sufficient subsets.
  • The same filtering logic might transfer to other tasks that rely on bounding-box weak labels, such as object detection.
  • Lower data volume would also cut the compute time and memory needed for each training run.

Load-bearing premise

The chosen small subsets still hold enough variety for the network to learn the same pixel-level class distinctions that the full weak collection would provide.

What would settle it

Train identical segmentation networks on the selected reduced sets versus the full weak sets and check whether mean intersection-over-union on a fixed test set drops below the full-set result.

Figures

Figures reproduced from arXiv: 1907.07023 by Gijs Dubbelman, Panagiotis Meletis, Rob Romijnders.

Figure 1
Figure 1. Figure 1: The proposed selection methods aim at selecting [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of selected images from N = 1.74 million Open Images images using our data selection methods in descending order. First row: visual similarity using GMM, the simcitys measure is shown. Second row: object diversity using class scores, the heuristics scores and the number of objects of interest are shown. xi . In other words we slice the output of f, to the set Φ containing H · W elements with C feat… view at source ↗
Figure 3
Figure 3. Figure 3: Performance (mIoU) on Cityscapes validation set. The networks are trained on Cityscapes Dense and optionally on additional selected data from Cityscapes Coarse and Open Images. The dots mark the conducted experiments. The black horizontal line denotes the mIoU of training without weak supervision. For the GMM model, we fit the parameters of the mixtures using Expectation Maximization. We continue updates u… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical histogram of the log probabilities for the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: tSNE plot for the image representations for a sample [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Training convolutional networks for semantic segmentation with strong (per-pixel) and weak (per-bounding-box) supervision requires a large amount of weakly labeled data. We propose two methods for selecting the most relevant data with weak supervision. The first method is designed for finding visually similar images without the need of labels and is based on modeling image representations with a Gaussian Mixture Model (GMM). As a byproduct of GMM modeling, we present useful insights on characterizing the data generating distribution. The second method aims at finding images with high object diversity and requires only the bounding box labels. Both methods are developed in the context of automated driving and experimentation is conducted on Cityscapes and Open Images datasets. We demonstrate performance gains by reducing the amount of employed weakly labeled images up to 100 times for Open Images and up to 20 times for Cityscapes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes two methods for selecting subsets of weakly labeled (bounding-box) images to train semantic segmentation CNNs: (1) GMM modeling of global image representations to identify visually similar images without using labels, and (2) a bounding-box-based selection for images with high object diversity. Experiments are performed in the automated-driving setting on Cityscapes and Open Images; the central claim is that these selections yield performance gains while reducing the weakly labeled training data by up to 20× (Cityscapes) and 100× (Open Images).

Significance. If the experimental results demonstrate that the reduced subsets maintain segmentation accuracy comparable to the full weak-supervision set, the work would be significant for reducing annotation and compute costs in large-scale semantic segmentation. The GMM byproduct insights on characterizing the data-generating distribution could also be useful for dataset analysis.

major comments (1)
  1. [Abstract and Methods] Abstract and Methods: the claim that GMM-selected subsets (and diversity-selected ones) allow a segmentation CNN to reach performance comparable to the full weak set is load-bearing, yet the method operates solely on global image embeddings. Nothing in the selection guarantees preservation of semantic class frequencies or spatial contexts; in automated-driving data, global features often correlate with scene style or illumination rather than object-class presence. If rare classes (e.g., traffic signs, cyclists) are under-represented, reported gains cannot be attributed to the selection preserving information content.
minor comments (1)
  1. [Abstract] The abstract states performance gains but supplies no quantitative numbers, baselines, error bars, or dataset splits, making it impossible to verify whether the claimed reductions actually preserve accuracy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to incorporate additional analysis as outlined.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: the claim that GMM-selected subsets (and diversity-selected ones) allow a segmentation CNN to reach performance comparable to the full weak set is load-bearing, yet the method operates solely on global image embeddings. Nothing in the selection guarantees preservation of semantic class frequencies or spatial contexts; in automated-driving data, global features often correlate with scene style or illumination rather than object-class presence. If rare classes (e.g., traffic signs, cyclists) are under-represented, reported gains cannot be attributed to the selection preserving information content.

    Authors: We agree that the GMM-based selection using global image embeddings provides no explicit guarantee of preserving semantic class frequencies or spatial contexts, and that global features in driving scenes may correlate more with style or illumination than with object presence. This is a substantive methodological limitation. Our defense rests on the empirical results: the selected subsets achieve segmentation performance comparable to the full weak-supervision set despite the large reductions (20× on Cityscapes, 100× on Open Images). These outcomes indicate that the visual similarity modeled by the GMM selects sufficiently informative images in practice for this task and these datasets. To directly address the concern, we will add an analysis of per-class frequencies (including rare classes such as traffic signs and cyclists) in the GMM-selected and diversity-selected subsets versus the full sets, to be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical methods with no derivations

full rationale

The paper describes two empirical data selection procedures (GMM modeling of image representations and bounding-box diversity counting) and reports experimental performance gains on Cityscapes and Open Images. No equations, derivations, predictions, or first-principles results are present in the provided text. Claims rest on standard statistical tools applied to external data rather than any self-definitional reduction, fitted-input renaming, or load-bearing self-citation chain. The work is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The GMM modeling implicitly assumes that visual similarity in feature space correlates with utility for segmentation training.

axioms (2)
  • domain assumption Image representations modeled by GMM capture visual similarity relevant to semantic segmentation performance
    Invoked by the first selection method; no justification supplied in abstract.
  • domain assumption Higher object diversity (measured by bounding boxes) improves training data quality for segmentation
    Invoked by the second selection method.

pith-pipeline@v0.9.0 · 5678 in / 1352 out tokens · 18038 ms · 2026-05-24T20:53:39.957250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 7 internal anchors

  1. [1]

    Semantic segmentation via multi-task, multi-domain learn- ing,

    D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tr ´emeau, and C. Wolf, “Semantic segmentation via multi-task, multi-domain learn- ing,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) . Springer, 2016, pp. 333–343

  2. [2]

    Training of convolutional networks on multiple heterogeneous datasets for street scene semantic segmen- tation,

    P. Meletis and G. Dubbelman, “Training of convolutional networks on multiple heterogeneous datasets for street scene semantic segmen- tation,” in 2018 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2018, pp. 1045–1050

  3. [3]

    Robust vision challenge,

    A. Geiger and et. al., “Robust vision challenge,” http://robustvision. net/index.php, 2018, [Online; accessed 12-April-2019]

  4. [4]

    Fully convolutional networks for semantic segmentation,

    J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440

  5. [5]

    Learning semantic segmentation with diverse supervision,

    L. Ye, Z. Liu, and Y . Wang, “Learning semantic segmentation with diverse supervision,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) . IEEE, 2018, pp. 1461–1469

  6. [6]

    Learning to segment under various forms of weak supervision,

    J. Xu, A. G. Schwing, and R. Urtasun, “Learning to segment under various forms of weak supervision,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3781–3790

  7. [7]

    Learning specific- class segmentation from diverse data,

    M. P. Kumar, H. Turki, D. Preston, and D. Koller, “Learning specific- class segmentation from diverse data,” in 2011 International Confer- ence on Computer Vision . IEEE, 2011, pp. 1800–1807

  8. [8]

    Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

    V . Birodkar, H. Mobahi, and S. Bengio, “Semantic redundancies in image-classification datasets: The 10% you don’t need,”arXiv preprint arXiv:1901.11409, 2019

  9. [9]

    Are All Training Examples Created Equal? An Empirical Study

    K. V odrahalli, K. Li, and J. Malik, “Are all training examples created equal? an empirical study,” arXiv preprint arXiv:1811.12569 , 2018

  10. [10]

    Pixel level data augmentation for semantic image segmentation using generative adversarial networks,

    S. Liu, J. Zhang, Y . Chen, Y . Liu, Z. Qin, and T. Wan, “Pixel level data augmentation for semantic image segmentation using generative adversarial networks,” arXiv preprint arXiv:1811.00174 , 2018

  11. [11]

    Implementation code for selection methods, inference and all mod- els will be made publicly available at https://github.com/pmeletis/ data-selection

    “Implementation code for selection methods, inference and all mod- els will be made publicly available at https://github.com/pmeletis/ data-selection.”

  12. [12]

    The cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016

  13. [13]

    The open images dataset v4: Unified image classification, object de- tection, and visual relationship detection at scale,

    A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. , “The open images dataset v4: Unified image classification, object de- tection, and visual relationship detection at scale,” arXiv preprint arXiv:1811.00982, 2018

  14. [14]

    On Boosting Semantic Street Scene Segmentation with Weak Supervision

    P. Meletis and G. Dubbelman, “On boosting semantic street scene seg- mentation with weak supervision,” arXiv preprint arXiv:1903.03462 , 2019

  15. [15]

    Image retrieval using gaussian mixture models,

    Z. Robotka and A. Zempl ´eni, “Image retrieval using gaussian mixture models,” Annals Univ. Sci. Budapest, Sect. Comp , vol. 31, pp. 93–105, 2009

  16. [16]

    Auto-Encoding Variational Bayes

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114 , 2013

  17. [17]

    Adversarially Learned Inference

    V . Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” arXiv preprint arXiv:1606.00704 , 2016

  18. [18]

    Generative adversarial nets,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Advances in neural information processing systems , 2014, pp. 2672– 2680

  19. [19]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057 , 2000

  20. [20]

    Taskonomy: Disentangling task transfer learning,

    A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 3712–3722

  21. [21]

    Representation learning: A review and new perspectives,

    Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 8, pp. 1798–1828, 2013

  22. [22]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

  23. [23]

    Finite mixture models,

    G. J. McLachlan, S. X. Lee, and S. I. Rathnayake, “Finite mixture models,” Annual review of statistics and its application , vol. 6, pp. 355–378, 2019

  24. [24]

    Scikit-learn: Machine learning in Python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Van- derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011

  25. [25]

    Visualizing data using t-sne,

    L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008

  26. [26]

    Variational Inference with Normalizing Flows

    D. J. Rezende and S. Mohamed, “Variational inference with normal- izing flows,” arXiv preprint arXiv:1505.05770 , 2015

  27. [27]

    Remarks on some nonparametric estimates of a density function,

    M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The Annals of Mathematical Statistics , pp. 832–837, 1956