pith. sign in

arxiv: 1907.07156 · v1 · pith:O2BUURFMnew · submitted 2019-07-16 · 💻 cs.CV · cs.LG

Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Pith reviewed 2026-05-24 20:44 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords semantic segmentationadaptive downsamplingcontent-adaptive samplingboundary preservationefficient inferencesmall object segmentationcomputer vision
0
0 comments X

The pith

A learned downsampling strategy prioritizes locations near semantic boundaries to improve segmentation accuracy and efficiency over uniform sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a network to decide which image locations to keep when downsampling the input for semantic segmentation, with a bias toward positions close to class boundaries. Uniform downsampling often loses small objects and blurs edges, so the adaptive approach aims to preserve detail where it matters most while still reducing pixel count and compute. The authors report that this yields segmentation maps with sharper boundaries and better coverage of small objects, together with a superior accuracy-to-cost curve on standard benchmarks. A reader would care because many real-time systems must run segmentation quickly yet cannot afford to miss fine details or tiny instances. The central mechanism is therefore the learned sampler that replaces fixed-grid downsampling.

Core claim

The paper claims that a content-adaptive downsampling module, trained to favor sampling locations near semantic boundaries, produces segmentation output whose boundary quality and small-object support exceed those of uniform downsampling at the same computational budget.

What carries the argument

A learned content-adaptive downsampling module that predicts per-location sampling decisions biased toward semantic boundaries.

If this is right

  • Boundary pixels in the output segmentation receive higher fidelity because more samples are allocated there.
  • Small objects receive more reliable support because the sampler avoids skipping them.
  • The accuracy-efficiency frontier improves, so a target accuracy can be reached with fewer operations.
  • The same input resolution can deliver higher overall segmentation quality without increasing compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampler could be inserted before other dense-prediction heads such as depth or instance segmentation.
  • In video streams the per-frame sampling decisions might be regularized across time to reduce flickering.
  • Hardware implementations could cache the learned sampling mask for repeated use on similar scenes.

Load-bearing premise

That the added sampling network can be trained stably and run at inference time without overhead that erases the intended efficiency gains.

What would settle it

A controlled experiment on a public segmentation dataset that measures mean IoU and FLOPs for the adaptive sampler versus uniform downsampling at multiple target resolutions; if the adaptive curve lies strictly above the uniform curve, the claim holds.

Figures

Figures reproduced from arXiv: 1907.07156 by Dmitrii Marin, Fei Yang, Peter Vajda, Priyam Chatterjee, Sam Tsai, Yuri Boykov, Zijian He.

Figure 1
Figure 1. Figure 1: Illustration of our content-adaptive downsampling [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Proposed efficient segmentation architecture with [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of non-uniform downsampling block [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Double U-Net model for predicting sampling parameters. The depth of the first sub-network can vary (depending [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples from Cityscapes [10] val set. (a): orig￾inal images and non-uniform 8 × 8 sampling tensor pro￾duced by our trained auxiliary net in [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cost-performance analysis on ApolloScape dataset. Proposed method performs better than the baseline method. With the same cost we can achieve higher quality. PSP-Net backbone 0.25 0.35 0.45 0.55 0 10 20 30 40 50 mIoU cost, ·109 flops Proposed Baseline Deeplabv3+ backbone 0.50 0.55 0.60 0.65 15 20 25 30 35 mIoU cost, ·10⁹ flops Proposed Baseline [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cost-performance analysis on Supervisely dataset. Our approach improves quality of segmentation. contrast, brightness and adding salt-and-pepper noise. 4.2. Cost-performance Analysis ApolloScape [27] is an open dataset for autonomous driv￾ing. The dataset consists of approximately 105K training and 8K validation images of size 3384×2710. The anno￾tations contain 22 classes for evaluation. The annotations … view at source ↗
Figure 9
Figure 9. Figure 9: Cost-performance analysis on CityScapes with PSP-Net and Deeplabv3+ baselines for varying downsam￾pling size, see Tab. 3. Our content-adaptive downsampling gives better results with the same computational cost. we set λ = 1. The auxiliary network predicts a sampling tensor of size (2, 8, 8), which is then resized to a required downsampling resolution. During training of the segmenta￾tion network we do not … view at source ↗
Figure 12
Figure 12. Figure 12: Average recall of objects broken down by object classes and sizes on the validation set of ApolloScapes. Values [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Absolute accuracy difference between our ap [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: Illustration for the piece-wise linear approxima [PITH_FULL_IMAGE:figures/full_fig_p009_16.png] view at source ↗
read the original abstract

Many automated processes such as auto-piloting rely on a good semantic segmentation as a critical component. To speed up performance, it is common to downsample the input frame. However, this comes at the cost of missed small objects and reduced accuracy at semantic boundaries. To address this problem, we propose a new content-adaptive downsampling technique that learns to favor sampling locations near semantic boundaries of target classes. Cost-performance analysis shows that our method consistently outperforms the uniform sampling improving balance between accuracy and computational efficiency. Our adaptive sampling gives segmentation with better quality of boundaries and more reliable support for smaller-size objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a content-adaptive downsampling technique for semantic segmentation that learns to favor sampling locations near semantic boundaries of target classes. It claims that this approach consistently outperforms uniform sampling in cost-performance trade-offs, yielding improved boundary quality and more reliable support for smaller objects.

Significance. If the empirical results hold with appropriate controls, the method could improve real-time segmentation pipelines in applications such as autonomous driving by reducing compute while preserving accuracy at semantically important locations.

major comments (1)
  1. [Abstract] Abstract: the claim that the method 'consistently outperforms the uniform sampling' is presented without any experimental details, baselines, error bars, training procedure, or quantitative results, so the central empirical claim cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the opportunity to address this point. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'consistently outperforms the uniform sampling' is presented without any experimental details, baselines, error bars, training procedure, or quantitative results, so the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract, by design, is a concise high-level summary and therefore omits the full experimental protocol, specific baselines, error bars, and quantitative tables. These details appear in Sections 4 (Experimental Setup) and 5 (Results), which include direct comparisons against uniform downsampling, multiple backbone architectures, Cityscapes and other datasets, mIoU and boundary F-score metrics, and training hyperparameters. The claim of consistent outperformance is derived from the cost-accuracy curves and per-class analyses shown in those sections. To make the abstract self-contained for readers who may only read the summary, we will revise it to include one or two key quantitative statements (e.g., typical mIoU gains at equivalent FLOPs) while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method for content-adaptive downsampling in semantic segmentation, with claims resting on experimental cost-performance comparisons to uniform sampling. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The central claims are externally verifiable through accuracy-efficiency trade-offs and boundary quality metrics, making the work self-contained without reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical formulation, so no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5636 in / 918 out tokens · 19053 ms · 2026-05-24T20:44:58.290614+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

  1. [1]

    https://caffe2.ai

    Caffe2: A New Lightweight, Modular, and Scalable Deep Learning Framework. https://caffe2.ai. 5

  2. [2]

    https://docs.scipy.org/ doc/scipy/reference/generated/scipy

    SciPy is open-source software for mathematics, science, and engineering. https://docs.scipy.org/ doc/scipy/reference/generated/scipy. interpolate.griddata.html. 4

  3. [3]

    Agarwal, A

    S. Agarwal, A. Awan, and D. Roth. Learning to detect ob- jects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 26(11):1475–1490, 2004. 5

  4. [4]

    Andrienko

    O. Andrienko. ICNet and PSPNet-50 in Tensorflow for real- time semantic segmentation. https://github.com/ oandrienko/fast-semantic-segmentation,

  5. [5]

    Badrinarayanan, A

    V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for im- age segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(12):2481–2495, Dec

  6. [6]

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully con- nected crfs. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 40(4):834–848, 2018. 2

  7. [7]

    L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re- thinking atrous convolution for semantic image segmenta- tion. arXiv preprint arXiv:1706.05587, 2017. 2

  8. [8]

    L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for se- mantic image segmentation. pages 801–818, 2018. 2, 4, 5, 7

  9. [9]

    F. Chollet. Xception: Deep learning with depthwise sepa- rable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , pages 1251–1258, 2017. 7

  10. [10]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 5, 7

  11. [11]

    J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 764–773, 2017. 3

  12. [12]

    Delaunay et al

    B. Delaunay et al. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk , 7(793- 800):1–2, 1934. 4

  13. [13]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html. 5

  14. [14]

    Everingham, A

    M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dork´o, et al. The 2005 pascal visual object classes chal- lenge. In Machine Learning Challenges. Evaluating Predic- tive Uncertainty, Visual Object Classification, and Recognis- ing Tectual Entailment, pages 117–176. Springer, 2006. 5

  15. [15]

    Fei-Fei, R

    L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories.Computer vision and Image understanding, 106(1):59–70, 2007. 5

  16. [16]

    Figurnov, M

    M. Figurnov, M. D. Collins, Y . Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive com- putation time for residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1039–1048, 2017. 3

  17. [17]

    Figurnov, A

    M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Per- foratedcnns: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pages 947–955, 2016. 3

  18. [18]

    Girshick

    R. Girshick. Fast r-cnn. In The IEEE International Confer- ence on Computer Vision (ICCV), December 2015. 2, 3

  19. [19]

    Girshick, J

    R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. 2, 3

  20. [20]

    Hariharan, P

    B. Hariharan, P. Arbel ´aez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV). IEEE, 2011. 5

  21. [21]

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r- cnn. In IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017. 2, 3

  22. [22]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition (CVPR) , pages 770–778, 2016. 2, 7

  23. [23]

    X. He, R. S. Zemel, and M. ´A. Carreira-Perpi˜n´an. Multiscale conditional random fields for image labeling. InProceedings of the 2004 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages II– II. IEEE, 2004. 2

  24. [24]

    Hernndez-Mederos and J

    V . Hernndez-Mederos and J. Estrada-Sarlabous. Sampling points on regular parametric curves with control of their dis- tribution. Computer Aided Geometric Design , 20(6):363 – 382, 2003. 3

  25. [25]

    Holschneider, R

    M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. Springer, 1990. 2

  26. [26]

    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi- cient convolutional neural networks for mobile vision appli- cations. arXiv preprint arXiv:1704.04861, 2017. 1

  27. [27]

    The ApolloScape Open Dataset for Autonomous Driving and its Application

    X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y . Lin, and R. Yang. The apolloscape dataset for autonomous driving. arXiv preprint arXiv:1803.06184, 2018. 5, 6

  28. [28]

    Jaderberg, K

    M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015. 3

  29. [29]

    Jeon and J

    Y . Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1846–1854. IEEE, 2017. 3

  30. [30]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5

  31. [31]

    Kohli, P

    P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. International Journal of Com- puter Vision (IJCV), 82(3):302–324, 2009. 8

  32. [32]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 2

  33. [33]

    LeCun, Y

    Y . LeCun, Y . Bengio, et al. Convolutional networks for im- ages, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. 2

  34. [34]

    X. Li, Z. Liu, P. Luo, C. Change Loy, and X. Tang. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , pages 3193–3202, 2017. 3

  35. [35]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In European Conference on Com- puter Vision (ECCV), pages 740–755. Springer, 2014. 5 10

  36. [36]

    J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3431–3440, 2015. 2

  37. [37]

    Marin, M

    D. Marin, M. Tang, I. B. Ayed, and Y . Boykov. Beyond gradient descent for regularized segmentation losses. In IEEE conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019. 4

  38. [38]

    Newell, K

    A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. In European Confer- ence on Computer Vision (ECCV), pages 483–499. Springer,

  39. [39]

    H. Noh, S. Hong, and B. Han. Learning deconvolution net- work for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1520–1528, 2015. 2

  40. [40]

    S. M. Obeidat and S. Raman. An intelligent sampling method for inspecting free-form surfaces. The International Journal of Advanced Manufacturing Technology , 40(11- 12):1125–1136, 2009. 3

  41. [41]

    Peter, S

    P. Peter, S. Hoffmann, F. Nedwed, L. Hoeltgen, and J. We- ickert. From optimised inpainting with linear pdes towards competitive image compression codecs. In T. Br¨aunl, B. Mc- Cane, M. Rivera, and X. Yu, editors, Image and Video Tech- nology, pages 63–74, Cham, 2016. Springer International Publishing. 3, 9

  42. [42]

    Recasens, P

    A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Tor- ralba. Learning to zoom: a saliency-based sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51–66, 2018. 3

  43. [43]

    S. R. Richter, Z. Hayder, and V . Koltun. Playing for bench- marks. In IEEE International Conference on Computer Vi- sion, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 2232–2241, 2017. 5

  44. [44]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo- lutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer,

  45. [45]

    G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA Dataset: A Large Collection of Syn- thetic Images for Semantic Segmentation of Urban Scenes (SYNTHIA-Rand). In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 5, 7

  46. [46]

    Russell, P

    C. Russell, P. Kohli, P. H. Torr, et al. Associative hierarchi- cal crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference on , pages 739–746. IEEE, 2009. 2

  47. [47]

    Shirley, M

    P. Shirley, M. Ashikhmin, and S. Marschner. Fundamentals of Computer Graphics. A. K. Peters, Ltd., Natick, MA, USA, 2nd edition, 2005. 4

  48. [48]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2

  49. [49]

    Supervisely Person

    Supervise.ly. Releasing “Supervisely Person” dataset for teaching machines to segment humans. https://hackernoon.com/releasing-supervisely-person- dataset-for-teaching-machines-to-segment-humans- 1f1fc1f28469, 2018. 5, 7

  50. [50]

    M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y . Boykov. On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 507–522, 2018. 4

  51. [51]

    W. Tiller. Knot-removal algorithms for nurbs curves and sur- faces. Computer-Aided Design, 24(8):445 – 453, 1992. 3

  52. [52]

    Weston, F

    J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learn- ing via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012. 4

  53. [53]

    Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revis- iting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080, 2016. 2

  54. [54]

    F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In European Conference on Computer Vision (ECCV), pages 648–663. Springer, 2016. 2, 3

  55. [55]

    Multi-Scale Context Aggregation by Dilated Convolutions

    F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. arXiv preprint arXiv:1511.07122, 2015. 2

  56. [56]

    H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. arXiv preprint arXiv:1704.08545, 2017. 1

  57. [57]

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017. 2, 4, 5, 7 11