arxiv: 1706.05587 · v3 · submitted 2017-06-17 · 💻 cs.CV

Recognition: no theorem link

Rethinking Atrous Convolution for Semantic Image Segmentation

Florian Schroff, George Papandreou, Hartwig Adam, Liang-Chieh Chen

Pith reviewed 2026-05-12 00:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic image segmentationatrous convolutionmulti-scale contextDeepLabv3ASPPglobal contextPASCAL VOC 2012convolutional networks

0 comments

The pith

Atrous convolutions in cascaded or parallel modules plus global context enable accurate semantic segmentation without DenseCRF.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper revisits atrous convolution to let deep networks explicitly control their field of view and feature resolution for semantic image segmentation. It introduces modules that apply atrous convolution at multiple rates either in sequence or side-by-side to gather context across object scales. The authors further enlarge their earlier Atrous Spatial Pyramid Pooling module by adding image-level features that supply global scene context. On the PASCAL VOC 2012 benchmark the resulting DeepLabv3 system raises accuracy over earlier DeepLab versions that still needed DenseCRF post-processing and reaches performance comparable to other leading models. Readers care because pixel-accurate labeling is essential for scene understanding yet the new design removes a separate, computationally heavy post-processing stage.

Core claim

By employing atrous convolution in cascaded or parallel arrangements with several rates and by augmenting the Atrous Spatial Pyramid Pooling module with image-level global-context features, the DeepLabv3 architecture captures multi-scale information directly inside the network, yielding significant accuracy gains over prior DeepLab models on PASCAL VOC 2012 without requiring DenseCRF post-processing and matching the performance of contemporary state-of-the-art segmentation systems.

What carries the argument

Atrous Spatial Pyramid Pooling module augmented with image-level global features, together with cascaded or parallel atrous convolution blocks that apply multiple dilation rates.

If this is right

Multi-scale context can be extracted inside a single forward pass by probing features at several atrous rates in parallel or cascade.
Global image-level features can be fused with local convolutional features to improve scene layout understanding.
DenseCRF post-processing is no longer required to reach high segmentation accuracy on PASCAL VOC 2012.
The same modular atrous design can be inserted into other DCNN backbones to adapt their receptive fields without changing network depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The removal of the DenseCRF stage reduces both inference latency and memory use, which could support deployment on resource-limited devices.
Because the modules operate at different scales inside the network, similar rate-scheduling ideas may transfer to other dense-prediction tasks such as depth estimation or instance segmentation.
Sharing the exact training recipe allows direct ablation studies that isolate the effect of each atrous configuration on new datasets.

Load-bearing premise

The observed accuracy improvements are produced by the new atrous modules and global-context features rather than by any unstated changes in training schedule, data augmentation, or hyper-parameter choices.

What would settle it

Re-train the previous DeepLabv2 model with exactly the same training schedule, augmentation, and hyper-parameters used for DeepLabv3; if the mean intersection-over-union gap on the PASCAL VOC 2012 validation set closes or reverses, the contribution of the atrous modules is not established.

read the original abstract

In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepLabv3 refines atrous convolutions with cascade and parallel modules plus global image features to capture multi-scale context more effectively in semantic segmentation.

read the letter

DeepLabv3 is mainly about rethinking how to use atrous convolutions for handling different object scales in semantic image segmentation. The authors add cascade or parallel atrous modules and an image-level global feature to their ASPP setup, which lets the model get multi-scale context more effectively and without needing DenseCRF post-processing. What is new is the explicit cascade and parallel configurations for atrous rates, along with the global context augmentation. These are not trivial additions and represent a step forward from their prior versions. The paper does well by including details on implementation and training, which is helpful for anyone wanting to replicate or extend the work. The benchmark results on PASCAL VOC 2012 are presented as comparable to other top models, which is a practical outcome. The soft spots are not major. The stress-test concern about attributing gains to the modules versus training variables is reasonable in principle for empirical papers. However, since the authors share their training experience and the work is from the same team, the comparisons are likely fair. If the full paper has ablation tables showing the contribution of each part, that would make the central claim solid. Without those, the evidence would be weaker, but I expect they are included. This paper is for readers in computer vision who work on semantic segmentation or related dense prediction tasks. People looking for architectural ideas to improve context modeling will get value from it. It deserves a serious referee because it offers concrete, reproducible extensions to an established framework with reported results on a standard benchmark. I recommend that a serious editor send this to peer review.

Referee Report

2 major / 1 minor

Summary. The paper revisits atrous convolution for semantic image segmentation, proposing modules that apply atrous convolutions either in cascade or in parallel to capture multi-scale context, and augments the Atrous Spatial Pyramid Pooling (ASPP) module with image-level features to encode global context. It further details training practices and claims that the resulting DeepLabv3 system significantly improves over prior DeepLab versions (without DenseCRF) while attaining performance comparable to other state-of-the-art models on the PASCAL VOC 2012 benchmark.

Significance. If the reported gains are shown to stem from the architectural innovations rather than uncontrolled training variables, the work offers a practical and incremental advance in multi-scale context modeling for dense prediction tasks, building directly on prior ASPP designs with clear implementation guidance that could influence follow-on architectures.

major comments (2)

[Experiments and abstract] Experiments section (and comparisons to prior DeepLab versions): the central claim that performance gains arise from the cascade/parallel atrous modules and image-level feature augmentation requires explicit confirmation that training schedules, data augmentation, and hyper-parameters are identical to those used in the authors' previous DeepLab papers; the abstract's reference to sharing 'implementation details and experience on training our system' does not substitute for matched baselines, leaving open the possibility that gains are confounded by optimization differences.
[§3.2–3.3] §3.2–3.3 (atrous modules and ASPP augmentation): the manuscript should provide controlled ablations that isolate the incremental benefit of the new cascade/parallel atrous rates and the image-level feature branch while holding all training variables fixed, as the current presentation does not fully rule out that observed mIoU lifts are driven by the global context addition alone or by unstated hyper-parameter changes.

minor comments (1)

[Abstract] Abstract: the claims of 'significant improvement' and 'comparable performance' would be more informative if accompanied by the specific mIoU numbers and table references that appear later in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will update the paper to provide the requested clarifications and additional controlled experiments.

read point-by-point responses

Referee: [Experiments and abstract] Experiments section (and comparisons to prior DeepLab versions): the central claim that performance gains arise from the cascade/parallel atrous modules and image-level feature augmentation requires explicit confirmation that training schedules, data augmentation, and hyper-parameters are identical to those used in the authors' previous DeepLab papers; the abstract's reference to sharing 'implementation details and experience on training our system' does not substitute for matched baselines, leaving open the possibility that gains are confounded by optimization differences.

Authors: We agree that explicit confirmation of matched training variables is necessary to attribute gains to the architectural changes. In the revised manuscript we will add a dedicated paragraph and table in the Experiments section that directly compares the training schedule (poly learning-rate policy, iteration count, crop size, batch size), data augmentation (random scaling, horizontal flipping, color jitter), and hyper-parameters used in DeepLabv3 to those reported in our prior DeepLabv2 work, noting only the architecture-specific modifications. This will make clear that the core optimization settings remain identical. revision: yes
Referee: [§3.2–3.3] §3.2–3.3 (atrous modules and ASPP augmentation): the manuscript should provide controlled ablations that isolate the incremental benefit of the new cascade/parallel atrous rates and the image-level feature branch while holding all training variables fixed, as the current presentation does not fully rule out that observed mIoU lifts are driven by the global context addition alone or by unstated hyper-parameter changes.

Authors: We appreciate the request for stricter isolation of each component. While the current manuscript already reports ablation results for different atrous rates in cascade and parallel settings as well as the addition of the image-level feature branch, we acknowledge that these experiments could be presented more explicitly as incremental, fixed-training-protocol studies. In the revision we will insert a new table that starts from a common baseline and successively adds the cascade module, the parallel module, and the image-level features, reporting mIoU on the PASCAL VOC 2012 validation set under identical training settings for each step. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent results

full rationale

The paper proposes architectural changes (cascaded/parallel atrous convolutions and augmented ASPP) and reports empirical mIoU improvements on PASCAL VOC 2012. There is no derivation chain, no equations, and no 'predictions' that reduce by construction to fitted parameters or self-defined inputs. Benchmark numbers are externally falsifiable and not derived from the authors' prior fits. Self-references to earlier DeepLab versions are normal citations of prior empirical work and do not load-bear any mathematical reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard convolutional network assumptions plus the empirical hypothesis that multi-rate atrous modules plus global pooling capture scale variation better than prior designs. No new physical entities or mathematical axioms are introduced.

free parameters (2)

atrous rates
Multiple dilation rates are chosen for the cascade and parallel modules; exact values are not stated in the abstract but are free parameters tuned for the task.
training hyper-parameters
Learning rate schedule, batch size, and data augmentation choices are not detailed in the abstract yet directly affect the reported benchmark numbers.

axioms (1)

domain assumption Deep convolutional networks can be trained end-to-end on labeled segmentation data to produce per-pixel predictions.
Invoked implicitly when the authors state that the proposed modules improve segmentation performance.

pith-pipeline@v0.9.0 · 5455 in / 1355 out tokens · 49147 ms · 2026-05-12T00:20:57.968537+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection
cs.CV 2026-04 unverdicted novelty 7.0

Noise2Map repurposes diffusion model denoising into a direct predictor for semantic segmentation and change detection tasks in remote sensing, achieving top average ranks on benchmark datasets.
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
cs.CV 2026-04 unverdicted novelty 7.0

VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
cs.CR 2026-04 unverdicted novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline
cs.CV 2026-04 unverdicted novelty 7.0

OVRSISBenchV2 is a realistic benchmark expanding scene and category coverage for open-vocabulary remote sensing segmentation, with Pi-Seg baseline showing strong transfer via positive-incentive noise perturbations.
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

VGGT-Segmentor achieves new state-of-the-art cross-view segmentation on Ego-Exo4D with 67.7% and 68.0% average IoU using a geometry-enhanced model and correspondence-free pretraining that beats most supervised baselines.
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
cs.CV 2026-04 conditional novelty 7.0

A hierarchical prompt tree with self-reflection graph propagation enables positive forward and backward knowledge transfer in incremental surgical instrument segmentation, improving over baselines by more than 5% and ...
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
cs.CV 2026-05 unverdicted novelty 6.0

AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation
cs.RO 2026-05 unverdicted novelty 6.0

A compact network called Nano-U trained with quantization-aware distillation enables accurate binary terrain segmentation and runs efficiently on ESP32-S3 microcontrollers for tiny robots.
UnGAP: Uncertainty-Guided Affine Prompting for Real-Time Crack Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

UnGAP turns aleatoric uncertainty into an active calibration signal via affine feature modulation to fix gradient suppression in heteroscedastic crack segmentation while maintaining real-time performance.
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
cs.CV 2026-05 unverdicted novelty 6.0

RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.
DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-to-Sim Physical Calibration
cs.RO 2026-04 unverdicted novelty 6.0

DOT-Sim uses MPM physics plus learned residual optics to simulate deformable tactile sensors, supporting zero-shot sim-to-real transfer for classification and control tasks.
Diffusion Model as a Generalist Segmentation Learner
cs.CV 2026-04 unverdicted novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment
cs.CV 2026-04 unverdicted novelty 6.0

FryNet combines RGB and thermal imaging with adversarial regularization to segment oil areas, classify usability, and predict oxidation levels like PV and Totox with high accuracy on video data.
Lorentz Framework for Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3...
From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.
AIBuildAI: An AI Agent for Automatically Building AI Models
cs.AI 2026-04 unverdicted novelty 6.0

AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality
cs.CV 2026-04 unverdicted novelty 6.0

GTPBD-MM is the first multimodal benchmark for global terraced parcel extraction, integrating image, text, and DEM data with experiments showing that textual and terrain cues improve delineation accuracy over image-on...
Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization
cs.CV 2026-04 unverdicted novelty 6.0

A large pool of diverse artistic styles for style-transfer augmentation improves domain generalization in driving vision models more than repeated use of few styles or domain-matched styles, yielding the lightweight S...
CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

CrossWeaver introduces MIB and SAF modules to enable flexible, reliability-aware cross-modal interaction and fusion, achieving SOTA multimodal semantic segmentation with minimal parameters and generalization to unseen...
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
cs.CV 2026-04 unverdicted novelty 5.0

AutoAWG generates controllable adverse weather automotive videos via semantics-guided adaptive multi-control fusion and vanishing-point-anchored temporal synthesis from static images, reducing FID by 50% and FVD by 16...
DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DeltaSeg, a tiered-attention U-Net variant with a novel Deep Delta Attention module, outperforms 12 prior models on two multi-class structural defect segmentation benchmarks.
WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
cs.CV 2026-04 unverdicted novelty 5.0

WILD-SAM is a fine-tuned SAM variant using phase-aware MoE adapters and wavelet subband enhancement that achieves state-of-the-art landslide detection on wrapped InSAR data.
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

FoR-Net improves efficiency in semantic segmentation by focusing on hard regions with a learned selector and multi-scale convolutions, achieving competitive results on Cityscapes.
An End-to-End Decision-Aware Multi-Scale Attention-Based Model for Explainable Autonomous Driving
cs.CV 2026-04 unverdicted novelty 4.0

A decision-aware multi-scale attention network generates tailored explanations for autonomous driving choices and outperforms prior models on F1 and a new Joint F1 metric across two datasets.
A Benchmark Study of Segmentation Models and Adaptation Strategies for Landslide Detection from Satellite Imagery
cs.CV 2026-04 unverdicted novelty 4.0

Transformer-based models deliver strong landslide segmentation on satellite images, and parameter-efficient fine-tuning matches full fine-tuning accuracy while cutting trainable parameters by up to 95%.
UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

UA-Net segments TRISO fuel micrographs into five regions with 95.5% mIoU and 97.3% mP on 102 test images, while its meta-model detects misclassifications at 91.8% specificity and 93.5% sensitivity.
EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation
cs.CV 2026-04 unverdicted novelty 4.0

Early RGB-Depth-Edge fusion in EDFNet provides a competitive baseline for thin-obstacle segmentation on the DDOS dataset, with the best pretrained U-Net model reaching 0.244 Thin-Structure Evaluation Score.
ResAF-Net: An Anchor-Free Attention-Based Network for Tree Detection and Agricultural Mapping in Palestine
cs.CV 2026-04 unverdicted novelty 3.0

ResAF-Net detects trees in satellite imagery for Palestinian agriculture, reaching 82% recall and 63% mAP@0.5 on the MillionTrees validation set and deployed in a web GIS.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 30 Pith papers · 3 internal anchors

[1]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, A. Agarwal, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016

work page Pith review arXiv 2016
[2]

Adams, J

A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional ﬁltering using the permutohedral lattice. In Eurographics, 2010. Method Coarse mIOU DeepLabv2-CRF [11] 70.4 Deep Layer Cascade [52] 71.1 ML-CRNN [21] 71.2 Adelaide context [55] 71.6 FRRN [70] 71.8 LRR-4x [25] ✓ 71.8 ReﬁneNet [54] 73.6 FoveaNet [51] 74.1 Ladder DenseNet [46] 74.3 PEARL [42] 75.4 Glo...

work page 2010
[3]

Badrinarayanan, A

V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561, 2015

work page arXiv 2015
[4]

A. Brandt. Multi-level adaptive solutions to boundary-value problems. Mathematics of computation, 31(138):333–390, 1977

work page 1977
[5]

W. L. Briggs, V . E. Henson, and S. F. McCormick.A multigrid tutorial. SIAM, 2000

work page 2000
[6]

Byeon, T

W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In CVPR, 2015

work page 2015
[7]

Caesar, J

H. Caesar, J. Uijlings, and V . Ferrari. COCO-Stuff: Thing and stuff classes in context. arXiv:1612.03716, 2016

work page arXiv 2016
[8]

Chandra and I

S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- ference for semantic image segmentation with deep Gaussian CRFs. arXiv:1603.08358, 2016

work page arXiv 2016
[9]

L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with task-speciﬁc edge detection using cnns and a discriminatively trained domain transform. In CVPR, 2016

work page 2016
[10]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015

work page 2015
[11]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016

work page arXiv 2016
[12]

L.-C. Chen, Y . Yang, J. Wang, W. Xu, and A. L. Yuille. At- tention to scale: Scale-aware semantic image segmentation. In CVPR, 2016

work page 2016
[13]

F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv:1610.02357, 2016. Figure 8. Visualization results on Cityscapes val set when training with only train ﬁne set

work page arXiv 2016
[14]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016

work page 2016
[15]

J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. arXiv:1412.1283, 2014

work page arXiv 2014
[16]

J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmenta- tion. In ICCV, 2015

work page 2015
[17]

J. Dai, Y . Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. arXiv:1605.06409, 2016

work page arXiv 2016
[18]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. arXiv:1703.06211, 2017

work page arXiv 2017
[19]

Eigen and R

D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. arXiv:1411.4734, 2014

work page arXiv 2014
[20]

Everingham, S

M. Everingham, S. M. A. Eslami, L. V . Gool, C. K. I. Williams, J. Winn, and A. Zisserma. The pascal visual object classes challenge a retrospective. IJCV, 2014

work page 2014
[21]

H. Fan, X. Mei, D. Prokhorov, and H. Ling. Multi-level contextual rnns with attention model for scene labeling. arXiv:1607.02537, 2016

work page arXiv 2016
[22]

Farabet, C

C. Farabet, C. Couprie, L. Najman, and Y . LeCun. Learning hierarchical features for scene labeling. PAMI, 2013

work page 2013
[23]

J. Fu, J. Liu, Y . Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943, 2017

work page arXiv 2017
[24]

Gadde, V

R. Gadde, V . Jampani, and P. V . Gehler. Semantic video cnns through representation warping. In ICCV, 2017

work page 2017
[25]

Ghiasi and C

G. Ghiasi and C. C. Fowlkes. Laplacian reconstruction and reﬁnement for semantic segmentation. arXiv:1605.02264, 2016

work page arXiv 2016
[26]

Giusti, D

A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In ICIP, 2013

work page 2013
[27]

Gould, R

S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV. IEEE, 2009

work page 2009
[28]

Grauman and T

K. Grauman and T. Darrell. The pyramid match kernel: Dis- criminative classiﬁcation with sets of image features. InICCV, 2005

work page 2005
[29]

Hariharan, P

B. Hariharan, P. Arbel´aez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011

work page 2011
[30]

Hariharan, P

B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hyper- columns for object segmentation and ﬁne-grained localization. In CVPR, 2015

work page 2015
[31]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014

work page 2014
[32]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale conditional random ﬁelds for image labeling. In CVPR, 2004

work page 2004
[34]

Hinton, O

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS, 2014

work page 2014
[35]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[36]

Holschneider, R

M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time- Frequency Methods and Phase Space, pages 289–297. 1989

work page 1989
[37]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017

work page 2017
[38]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015

work page internal anchor Pith review arXiv 2015
[39]

M. A. Islam, M. Rochan, N. D. Bruce, and Y . Wang. Gated feedback reﬁnement network for dense image labeling. In CVPR, 2017

work page 2017
[40]

S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learn- ing to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR, 2017

work page 2017
[41]

Jampani, M

V . Jampani, M. Kiefel, and P. V . Gehler. Learning sparse high dimensional ﬁlters: Image ﬁltering, dense crfs and bilateral neural networks. In CVPR, 2016

work page 2016
[42]

X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y . Chen, J. Dong, L. Liu, Z. Jie, J. Feng, and S. Yan. Video scene parsing with predictive feature learning. In ICCV, 2017

work page 2017
[43]

Kohli, P

P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. IJCV, 82(3):302–324, 2009

work page 2009
[44]

Kong and C

S. Kong and C. Fowlkes. Recurrent scene parsing with per- spective understanding in the loop. arXiv:1705.07238, 2017

work page arXiv 2017
[45]

Kr¨ahenb¨uhl and V

P. Kr¨ahenb¨uhl and V . Koltun. Efﬁcient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011

work page 2011
[46]

Kreˇso, S

I. Kreˇso, S. ˇSegvi´c, and J. Krapac. Ladder-style densenets for semantic segmentation of large natural images. In ICCV CVRSUAD workshop, 2017

work page 2017
[47]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012
[48]

Ladicky, C

L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009

work page 2009
[49]

Lazebnik, C

S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of fea- tures: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006

work page 2006
[50]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computa- tion, 1(4):541–551, 1989

work page 1989
[51]

X. Li, Z. Jie, W. Wang, C. Liu, J. Yang, X. Shen, Z. Lin, Q. Chen, S. Yan, and J. Feng. Foveanet: Perspective-aware urban scene parsing. arXiv:1708.02421, 2017

work page arXiv 2017
[52]

X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not all pixels are equal: Difﬁculty-aware semantic segmentation via deep layer cascade. arXiv:1704.01344, 2017

work page arXiv 2017
[53]

Liang, X

X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan. Semantic object parsing with local-global long short-term memory. arXiv:1511.04510, 2015

work page arXiv 2015
[54]

G. Lin, A. Milan, C. Shen, and I. Reid. Reﬁnenet: Multi- path reﬁnement networks with identity mappings for high- resolution semantic segmentation. arXiv:1611.06612, 2016

work page arXiv 2016
[55]

G. Lin, C. Shen, I. Reid, et al. Efﬁcient piecewise train- ing of deep structured models for semantic segmentation. arXiv:1504.01013, 2015

work page arXiv 2015
[56]

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv:1612.03144, 2016

work page arXiv 2016
[57]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Com- mon objects in context. In ECCV, 2014

work page 2014
[58]

W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015

work page arXiv 2015
[59]

Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015

work page 2015
[60]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015
[61]

P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning for semantic image segmentation. In ICCV, 2017

work page 2017
[62]

Mostajabi, P

M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed- forward semantic segmentation with zoom-out features. In CVPR, 2015

work page 2015
[63]

Mottaghi, X

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014

work page 2014
[64]

H. Noh, S. Hong, and B. Han. Learning deconvolution net- work for semantic segmentation. In ICCV, 2015

work page 2015
[65]

Papandreou, L.-C

G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly- and semi-supervised learning of a dcnn for semantic image segmentation. In ICCV, 2015

work page 2015
[66]

Papandreou, I

G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, 2015

work page 2015
[67]

Papandreou and P

G. Papandreou and P. Maragos. Multigrid geometric active contour models. TIP, 16(1):229–240, 2007

work page 2007
[68]

C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolu- tional network. arXiv:1703.02719, 2017

work page arXiv 2017
[69]

Pinheiro and R

P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014

work page 2014
[70]

Pohlen, A

T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full- resolution residual networks for semantic segmentation in street scenes. arXiv:1611.08323, 2016

work page arXiv 2016
[71]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

work page 2015
[72]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015

work page 2015
[73]

A. G. Schwing and R. Urtasun. Fully connected deep struc- tured networks. arXiv:1503.02351, 2015

work page arXiv 2015
[74]

Sermanet, D

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013

work page arXiv 2013
[75]

F. Shen, R. Gan, S. Yan, and G. Zeng. Semantic segmentation via structured patch prediction, context crf and guidance crf. In CVPR, 2017

work page 2017
[76]

Shotton, J

J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009

work page 2009
[77]

Shrivastava, R

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be- yond skip connections: Top-down modulation for object de- tection. arXiv:1612.06851, 2016

work page arXiv 2016
[78]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[79]

C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017

work page 2017
[80]

H. Sun, D. Xie, and S. Pu. Mixed context networks for semantic segmentation. arXiv:1610.05854, 2016

work page arXiv 2016

Showing first 80 references.