Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Dmitrii Marin; Fei Yang; Peter Vajda; Priyam Chatterjee; Sam Tsai; Yuri Boykov; Zijian He

arxiv: 1907.07156 · v1 · pith:O2BUURFMnew · submitted 2019-07-16 · 💻 cs.CV · cs.LG

Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Dmitrii Marin , Zijian He , Peter Vajda , Priyam Chatterjee , Sam Tsai , Fei Yang , Yuri Boykov This is my paper

Pith reviewed 2026-05-24 20:44 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords semantic segmentationadaptive downsamplingcontent-adaptive samplingboundary preservationefficient inferencesmall object segmentationcomputer vision

0 comments

The pith

A learned downsampling strategy prioritizes locations near semantic boundaries to improve segmentation accuracy and efficiency over uniform sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a network to decide which image locations to keep when downsampling the input for semantic segmentation, with a bias toward positions close to class boundaries. Uniform downsampling often loses small objects and blurs edges, so the adaptive approach aims to preserve detail where it matters most while still reducing pixel count and compute. The authors report that this yields segmentation maps with sharper boundaries and better coverage of small objects, together with a superior accuracy-to-cost curve on standard benchmarks. A reader would care because many real-time systems must run segmentation quickly yet cannot afford to miss fine details or tiny instances. The central mechanism is therefore the learned sampler that replaces fixed-grid downsampling.

Core claim

The paper claims that a content-adaptive downsampling module, trained to favor sampling locations near semantic boundaries, produces segmentation output whose boundary quality and small-object support exceed those of uniform downsampling at the same computational budget.

What carries the argument

A learned content-adaptive downsampling module that predicts per-location sampling decisions biased toward semantic boundaries.

If this is right

Boundary pixels in the output segmentation receive higher fidelity because more samples are allocated there.
Small objects receive more reliable support because the sampler avoids skipping them.
The accuracy-efficiency frontier improves, so a target accuracy can be reached with fewer operations.
The same input resolution can deliver higher overall segmentation quality without increasing compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampler could be inserted before other dense-prediction heads such as depth or instance segmentation.
In video streams the per-frame sampling decisions might be regularized across time to reduce flickering.
Hardware implementations could cache the learned sampling mask for repeated use on similar scenes.

Load-bearing premise

That the added sampling network can be trained stably and run at inference time without overhead that erases the intended efficiency gains.

What would settle it

A controlled experiment on a public segmentation dataset that measures mean IoU and FLOPs for the adaptive sampler versus uniform downsampling at multiple target resolutions; if the adaptive curve lies strictly above the uniform curve, the claim holds.

Figures

Figures reproduced from arXiv: 1907.07156 by Dmitrii Marin, Fei Yang, Peter Vajda, Priyam Chatterjee, Sam Tsai, Yuri Boykov, Zijian He.

**Figure 2.** Figure 2: Proposed efficient segmentation architecture with [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Architecture of non-uniform downsampling block [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Double U-Net model for predicting sampling parameters. The depth of the first sub-network can vary (depending [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Examples from Cityscapes [10] val set. (a): original images and non-uniform 8 × 8 sampling tensor produced by our trained auxiliary net in [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Cost-performance analysis on ApolloScape dataset. Proposed method performs better than the baseline method. With the same cost we can achieve higher quality. PSP-Net backbone 0.25 0.35 0.45 0.55 0 10 20 30 40 50 mIoU cost, ·109 flops Proposed Baseline Deeplabv3+ backbone 0.50 0.55 0.60 0.65 15 20 25 30 35 mIoU cost, ·10⁹ flops Proposed Baseline [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 11.** Figure 11: Cost-performance analysis on Supervisely dataset. Our approach improves quality of segmentation. contrast, brightness and adding salt-and-pepper noise. 4.2. Cost-performance Analysis ApolloScape [27] is an open dataset for autonomous driving. The dataset consists of approximately 105K training and 8K validation images of size 3384×2710. The annotations contain 22 classes for evaluation. The annotations … view at source ↗

**Figure 9.** Figure 9: Cost-performance analysis on CityScapes with PSP-Net and Deeplabv3+ baselines for varying downsampling size, see Tab. 3. Our content-adaptive downsampling gives better results with the same computational cost. we set λ = 1. The auxiliary network predicts a sampling tensor of size (2, 8, 8), which is then resized to a required downsampling resolution. During training of the segmentation network we do not … view at source ↗

**Figure 12.** Figure 12: Average recall of objects broken down by object classes and sizes on the validation set of ApolloScapes. Values [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 13.** Figure 13: Absolute accuracy difference between our ap [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

**Figure 16.** Figure 16: Illustration for the piece-wise linear approxima [PITH_FULL_IMAGE:figures/full_fig_p009_16.png] view at source ↗

read the original abstract

Many automated processes such as auto-piloting rely on a good semantic segmentation as a critical component. To speed up performance, it is common to downsample the input frame. However, this comes at the cost of missed small objects and reduced accuracy at semantic boundaries. To address this problem, we propose a new content-adaptive downsampling technique that learns to favor sampling locations near semantic boundaries of target classes. Cost-performance analysis shows that our method consistently outperforms the uniform sampling improving balance between accuracy and computational efficiency. Our adaptive sampling gives segmentation with better quality of boundaries and more reliable support for smaller-size objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a learned content-adaptive downsampling that favors semantic boundaries in segmentation, claiming better accuracy-efficiency trade-offs than uniform sampling.

read the letter

Hey, the main thing to know is that this work trains a downsampler to place more samples near semantic boundaries rather than using a uniform grid, and the abstract says this gives better boundary quality, small-object handling, and overall cost-performance than standard downsampling for real-time tasks like auto-piloting. It is a direct, practical extension of adaptive sampling ideas to a known pain point in efficient segmentation. The approach is empirical and the central claim is testable: the added sampling decisions improve the trade-off without obvious logical gaps. The stress-test note is right that no internal inconsistency shows up in the stated argument. The weakest assumption flagged in the reader report is also fair but ordinary for this kind of method paper. On the downside, the abstract supplies no training details, baselines, or quantitative numbers, so the size of the gains and any inference overhead from the sampler itself cannot be judged yet. That keeps the soundness assessment provisional until the full experiments are checked. The work is incremental rather than foundational, but the application is clear and the problem it targets is real. It would interest people building real-time segmentation pipelines who already care about downsampling choices. A serious referee should look at it to verify the experimental comparisons and overhead measurements. I would send it to peer review rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a content-adaptive downsampling technique for semantic segmentation that learns to favor sampling locations near semantic boundaries of target classes. It claims that this approach consistently outperforms uniform sampling in cost-performance trade-offs, yielding improved boundary quality and more reliable support for smaller objects.

Significance. If the empirical results hold with appropriate controls, the method could improve real-time segmentation pipelines in applications such as autonomous driving by reducing compute while preserving accuracy at semantically important locations.

major comments (1)

[Abstract] Abstract: the claim that the method 'consistently outperforms the uniform sampling' is presented without any experimental details, baselines, error bars, training procedure, or quantitative results, so the central empirical claim cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the opportunity to address this point. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the method 'consistently outperforms the uniform sampling' is presented without any experimental details, baselines, error bars, training procedure, or quantitative results, so the central empirical claim cannot be evaluated.

Authors: We agree that the abstract, by design, is a concise high-level summary and therefore omits the full experimental protocol, specific baselines, error bars, and quantitative tables. These details appear in Sections 4 (Experimental Setup) and 5 (Results), which include direct comparisons against uniform downsampling, multiple backbone architectures, Cityscapes and other datasets, mIoU and boundary F-score metrics, and training hyperparameters. The claim of consistent outperformance is derived from the cost-accuracy curves and per-class analyses shown in those sections. To make the abstract self-contained for readers who may only read the summary, we will revise it to include one or two key quantitative statements (e.g., typical mIoU gains at equivalent FLOPs) while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method for content-adaptive downsampling in semantic segmentation, with claims resting on experimental cost-performance comparisons to uniform sampling. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The central claims are externally verifiable through accuracy-efficiency trade-offs and boundary quality metrics, making the work self-contained without reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical formulation, so no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5636 in / 918 out tokens · 19053 ms · 2026-05-24T20:44:58.290614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

[1]

https://caffe2.ai

Caffe2: A New Lightweight, Modular, and Scalable Deep Learning Framework. https://caffe2.ai. 5

work page
[2]

https://docs.scipy.org/ doc/scipy/reference/generated/scipy

SciPy is open-source software for mathematics, science, and engineering. https://docs.scipy.org/ doc/scipy/reference/generated/scipy. interpolate.griddata.html. 4

work page
[3]

Agarwal, A

S. Agarwal, A. Awan, and D. Roth. Learning to detect ob- jects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 26(11):1475–1490, 2004. 5

work page 2004
[4]

Andrienko

O. Andrienko. ICNet and PSPNet-50 in Tensorﬂow for real- time semantic segmentation. https://github.com/ oandrienko/fast-semantic-segmentation,

work page
[5]

Badrinarayanan, A

V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for im- age segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(12):2481–2495, Dec

work page
[6]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully con- nected crfs. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 40(4):834–848, 2018. 2

work page 2018
[7]

L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re- thinking atrous convolution for semantic image segmenta- tion. arXiv preprint arXiv:1706.05587, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for se- mantic image segmentation. pages 801–818, 2018. 2, 4, 5, 7

work page 2018
[9]

F. Chollet. Xception: Deep learning with depthwise sepa- rable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , pages 1251–1258, 2017. 7

work page 2017
[10]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 5, 7

work page 2016
[11]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 764–773, 2017. 3

work page 2017
[12]

Delaunay et al

B. Delaunay et al. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk , 7(793- 800):1–2, 1934. 4

work page 1934
[13]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html. 5

work page 2012
[14]

Everingham, A

M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dork´o, et al. The 2005 pascal visual object classes chal- lenge. In Machine Learning Challenges. Evaluating Predic- tive Uncertainty, Visual Object Classiﬁcation, and Recognis- ing Tectual Entailment, pages 117–176. Springer, 2006. 5

work page 2005
[15]

Fei-Fei, R

L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories.Computer vision and Image understanding, 106(1):59–70, 2007. 5

work page 2007
[16]

Figurnov, M

M. Figurnov, M. D. Collins, Y . Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive com- putation time for residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1039–1048, 2017. 3

work page 2017
[17]

Figurnov, A

M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Per- foratedcnns: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pages 947–955, 2016. 3

work page 2016
[18]

Girshick

R. Girshick. Fast r-cnn. In The IEEE International Confer- ence on Computer Vision (ICCV), December 2015. 2, 3

work page 2015
[19]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. 2, 3

work page 2014
[20]

Hariharan, P

B. Hariharan, P. Arbel ´aez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV). IEEE, 2011. 5

work page 2011
[21]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r- cnn. In IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017. 2, 3

work page 2017
[22]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition (CVPR) , pages 770–778, 2016. 2, 7

work page 2016
[23]

X. He, R. S. Zemel, and M. ´A. Carreira-Perpi˜n´an. Multiscale conditional random ﬁelds for image labeling. InProceedings of the 2004 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages II– II. IEEE, 2004. 2

work page 2004
[24]

Hernndez-Mederos and J

V . Hernndez-Mederos and J. Estrada-Sarlabous. Sampling points on regular parametric curves with control of their dis- tribution. Computer Aided Geometric Design , 20(6):363 – 382, 2003. 3

work page 2003
[25]

Holschneider, R

M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. Springer, 1990. 2

work page 1990
[26]

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efﬁ- cient convolutional neural networks for mobile vision appli- cations. arXiv preprint arXiv:1704.04861, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

The ApolloScape Open Dataset for Autonomous Driving and its Application

X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y . Lin, and R. Yang. The apolloscape dataset for autonomous driving. arXiv preprint arXiv:1803.06184, 2018. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015. 3

work page 2017
[29]

Jeon and J

Y . Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classiﬁcation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1846–1854. IEEE, 2017. 3

work page 2017
[30]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Kohli, P

P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. International Journal of Com- puter Vision (IJCV), 82(3):302–324, 2009. 8

work page 2009
[32]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 2

work page 2012
[33]

LeCun, Y

Y . LeCun, Y . Bengio, et al. Convolutional networks for im- ages, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. 2

work page 1995
[34]

X. Li, Z. Liu, P. Luo, C. Change Loy, and X. Tang. Not all pixels are equal: Difﬁculty-aware semantic segmentation via deep layer cascade. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , pages 3193–3202, 2017. 3

work page 2017
[35]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In European Conference on Com- puter Vision (ECCV), pages 740–755. Springer, 2014. 5 10

work page 2014
[36]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3431–3440, 2015. 2

work page 2015
[37]

Marin, M

D. Marin, M. Tang, I. B. Ayed, and Y . Boykov. Beyond gradient descent for regularized segmentation losses. In IEEE conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019. 4

work page 2019
[38]

Newell, K

A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. In European Confer- ence on Computer Vision (ECCV), pages 483–499. Springer,

work page
[39]

H. Noh, S. Hong, and B. Han. Learning deconvolution net- work for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1520–1528, 2015. 2

work page 2015
[40]

S. M. Obeidat and S. Raman. An intelligent sampling method for inspecting free-form surfaces. The International Journal of Advanced Manufacturing Technology , 40(11- 12):1125–1136, 2009. 3

work page 2009
[41]

Peter, S

P. Peter, S. Hoffmann, F. Nedwed, L. Hoeltgen, and J. We- ickert. From optimised inpainting with linear pdes towards competitive image compression codecs. In T. Br¨aunl, B. Mc- Cane, M. Rivera, and X. Yu, editors, Image and Video Tech- nology, pages 63–74, Cham, 2016. Springer International Publishing. 3, 9

work page 2016
[42]

Recasens, P

A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Tor- ralba. Learning to zoom: a saliency-based sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51–66, 2018. 3

work page 2018
[43]

S. R. Richter, Z. Hayder, and V . Koltun. Playing for bench- marks. In IEEE International Conference on Computer Vi- sion, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 2232–2241, 2017. 5

work page 2017
[44]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo- lutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer,

work page
[45]

G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA Dataset: A Large Collection of Syn- thetic Images for Semantic Segmentation of Urban Scenes (SYNTHIA-Rand). In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 5, 7

work page 2016
[46]

Russell, P

C. Russell, P. Kohli, P. H. Torr, et al. Associative hierarchi- cal crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference on , pages 739–746. IEEE, 2009. 2

work page 2009
[47]

Shirley, M

P. Shirley, M. Ashikhmin, and S. Marschner. Fundamentals of Computer Graphics. A. K. Peters, Ltd., Natick, MA, USA, 2nd edition, 2005. 4

work page 2005
[48]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2

work page internal anchor Pith review Pith/arXiv arXiv 2014
[49]

Supervisely Person

Supervise.ly. Releasing “Supervisely Person” dataset for teaching machines to segment humans. https://hackernoon.com/releasing-supervisely-person- dataset-for-teaching-machines-to-segment-humans- 1f1fc1f28469, 2018. 5, 7

work page 2018
[50]

M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y . Boykov. On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 507–522, 2018. 4

work page 2018
[51]

W. Tiller. Knot-removal algorithms for nurbs curves and sur- faces. Computer-Aided Design, 24(8):445 – 453, 1992. 3

work page 1992
[52]

Weston, F

J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learn- ing via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012. 4

work page 2012
[53]

Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revis- iting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016
[54]

F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In European Conference on Computer Vision (ECCV), pages 648–663. Springer, 2016. 2, 3

work page 2016
[55]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. arXiv preprint arXiv:1511.07122, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. arXiv preprint arXiv:1704.08545, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017. 2, 4, 5, 7 11

work page 2017

[1] [1]

https://caffe2.ai

Caffe2: A New Lightweight, Modular, and Scalable Deep Learning Framework. https://caffe2.ai. 5

work page

[2] [2]

https://docs.scipy.org/ doc/scipy/reference/generated/scipy

SciPy is open-source software for mathematics, science, and engineering. https://docs.scipy.org/ doc/scipy/reference/generated/scipy. interpolate.griddata.html. 4

work page

[3] [3]

Agarwal, A

S. Agarwal, A. Awan, and D. Roth. Learning to detect ob- jects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 26(11):1475–1490, 2004. 5

work page 2004

[4] [4]

Andrienko

O. Andrienko. ICNet and PSPNet-50 in Tensorﬂow for real- time semantic segmentation. https://github.com/ oandrienko/fast-semantic-segmentation,

work page

[5] [5]

Badrinarayanan, A

V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for im- age segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(12):2481–2495, Dec

work page

[6] [6]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully con- nected crfs. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 40(4):834–848, 2018. 2

work page 2018

[7] [7]

L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re- thinking atrous convolution for semantic image segmenta- tion. arXiv preprint arXiv:1706.05587, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for se- mantic image segmentation. pages 801–818, 2018. 2, 4, 5, 7

work page 2018

[9] [9]

F. Chollet. Xception: Deep learning with depthwise sepa- rable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , pages 1251–1258, 2017. 7

work page 2017

[10] [10]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 5, 7

work page 2016

[11] [11]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 764–773, 2017. 3

work page 2017

[12] [12]

Delaunay et al

B. Delaunay et al. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk , 7(793- 800):1–2, 1934. 4

work page 1934

[13] [13]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html. 5

work page 2012

[14] [14]

Everingham, A

M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dork´o, et al. The 2005 pascal visual object classes chal- lenge. In Machine Learning Challenges. Evaluating Predic- tive Uncertainty, Visual Object Classiﬁcation, and Recognis- ing Tectual Entailment, pages 117–176. Springer, 2006. 5

work page 2005

[15] [15]

Fei-Fei, R

L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories.Computer vision and Image understanding, 106(1):59–70, 2007. 5

work page 2007

[16] [16]

Figurnov, M

M. Figurnov, M. D. Collins, Y . Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive com- putation time for residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1039–1048, 2017. 3

work page 2017

[17] [17]

Figurnov, A

M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Per- foratedcnns: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pages 947–955, 2016. 3

work page 2016

[18] [18]

Girshick

R. Girshick. Fast r-cnn. In The IEEE International Confer- ence on Computer Vision (ICCV), December 2015. 2, 3

work page 2015

[19] [19]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. 2, 3

work page 2014

[20] [20]

Hariharan, P

B. Hariharan, P. Arbel ´aez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV). IEEE, 2011. 5

work page 2011

[21] [21]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r- cnn. In IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017. 2, 3

work page 2017

[22] [22]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition (CVPR) , pages 770–778, 2016. 2, 7

work page 2016

[23] [23]

X. He, R. S. Zemel, and M. ´A. Carreira-Perpi˜n´an. Multiscale conditional random ﬁelds for image labeling. InProceedings of the 2004 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages II– II. IEEE, 2004. 2

work page 2004

[24] [24]

Hernndez-Mederos and J

V . Hernndez-Mederos and J. Estrada-Sarlabous. Sampling points on regular parametric curves with control of their dis- tribution. Computer Aided Geometric Design , 20(6):363 – 382, 2003. 3

work page 2003

[25] [25]

Holschneider, R

M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. Springer, 1990. 2

work page 1990

[26] [26]

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efﬁ- cient convolutional neural networks for mobile vision appli- cations. arXiv preprint arXiv:1704.04861, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

The ApolloScape Open Dataset for Autonomous Driving and its Application

X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y . Lin, and R. Yang. The apolloscape dataset for autonomous driving. arXiv preprint arXiv:1803.06184, 2018. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015. 3

work page 2017

[29] [29]

Jeon and J

Y . Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classiﬁcation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1846–1854. IEEE, 2017. 3

work page 2017

[30] [30]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5

work page internal anchor Pith review Pith/arXiv arXiv 2014

[31] [31]

Kohli, P

P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. International Journal of Com- puter Vision (IJCV), 82(3):302–324, 2009. 8

work page 2009

[32] [32]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 2

work page 2012

[33] [33]

LeCun, Y

Y . LeCun, Y . Bengio, et al. Convolutional networks for im- ages, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. 2

work page 1995

[34] [34]

X. Li, Z. Liu, P. Luo, C. Change Loy, and X. Tang. Not all pixels are equal: Difﬁculty-aware semantic segmentation via deep layer cascade. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , pages 3193–3202, 2017. 3

work page 2017

[35] [35]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In European Conference on Com- puter Vision (ECCV), pages 740–755. Springer, 2014. 5 10

work page 2014

[36] [36]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3431–3440, 2015. 2

work page 2015

[37] [37]

Marin, M

D. Marin, M. Tang, I. B. Ayed, and Y . Boykov. Beyond gradient descent for regularized segmentation losses. In IEEE conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019. 4

work page 2019

[38] [38]

Newell, K

A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. In European Confer- ence on Computer Vision (ECCV), pages 483–499. Springer,

work page

[39] [39]

H. Noh, S. Hong, and B. Han. Learning deconvolution net- work for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1520–1528, 2015. 2

work page 2015

[40] [40]

S. M. Obeidat and S. Raman. An intelligent sampling method for inspecting free-form surfaces. The International Journal of Advanced Manufacturing Technology , 40(11- 12):1125–1136, 2009. 3

work page 2009

[41] [41]

Peter, S

P. Peter, S. Hoffmann, F. Nedwed, L. Hoeltgen, and J. We- ickert. From optimised inpainting with linear pdes towards competitive image compression codecs. In T. Br¨aunl, B. Mc- Cane, M. Rivera, and X. Yu, editors, Image and Video Tech- nology, pages 63–74, Cham, 2016. Springer International Publishing. 3, 9

work page 2016

[42] [42]

Recasens, P

A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Tor- ralba. Learning to zoom: a saliency-based sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51–66, 2018. 3

work page 2018

[43] [43]

S. R. Richter, Z. Hayder, and V . Koltun. Playing for bench- marks. In IEEE International Conference on Computer Vi- sion, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 2232–2241, 2017. 5

work page 2017

[44] [44]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo- lutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer,

work page

[45] [45]

G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA Dataset: A Large Collection of Syn- thetic Images for Semantic Segmentation of Urban Scenes (SYNTHIA-Rand). In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 5, 7

work page 2016

[46] [46]

Russell, P

C. Russell, P. Kohli, P. H. Torr, et al. Associative hierarchi- cal crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference on , pages 739–746. IEEE, 2009. 2

work page 2009

[47] [47]

Shirley, M

P. Shirley, M. Ashikhmin, and S. Marschner. Fundamentals of Computer Graphics. A. K. Peters, Ltd., Natick, MA, USA, 2nd edition, 2005. 4

work page 2005

[48] [48]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2

work page internal anchor Pith review Pith/arXiv arXiv 2014

[49] [49]

Supervisely Person

Supervise.ly. Releasing “Supervisely Person” dataset for teaching machines to segment humans. https://hackernoon.com/releasing-supervisely-person- dataset-for-teaching-machines-to-segment-humans- 1f1fc1f28469, 2018. 5, 7

work page 2018

[50] [50]

M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y . Boykov. On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 507–522, 2018. 4

work page 2018

[51] [51]

W. Tiller. Knot-removal algorithms for nurbs curves and sur- faces. Computer-Aided Design, 24(8):445 – 453, 1992. 3

work page 1992

[52] [52]

Weston, F

J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learn- ing via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012. 4

work page 2012

[53] [53]

Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revis- iting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016

[54] [54]

F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In European Conference on Computer Vision (ECCV), pages 648–663. Springer, 2016. 2, 3

work page 2016

[55] [55]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. arXiv preprint arXiv:1511.07122, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[56] [56]

H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. arXiv preprint arXiv:1704.08545, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [57]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017. 2, 4, 5, 7 11

work page 2017