pith. sign in

arxiv: 1906.09826 · v1 · pith:NULKT44Ynew · submitted 2019-06-24 · 💻 cs.CV

ESNet: An Efficient Symmetric Network for Real-time Semantic Segmentation

Pith reviewed 2026-05-25 17:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords real-time semantic segmentationefficient CNNCityscapesfactorized convolutionssymmetric networkdeep learningsemantic segmentation
0
0 comments X

The pith

ESNet's symmetric design of factorized and parallel convolution units enables real-time semantic segmentation with only 1.6 million parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ESNet as a solution to the high computational cost of semantic segmentation networks by building a nearly symmetric architecture around factorized convolution units. These units incorporate 1D factorized convolutions in residuals, while parallel versions split paths to apply dilated convolutions at varying rates before merging. If effective, this would allow accurate segmentation to run at over 62 frames per second on standard GPUs with minimal memory footprint, making it practical for embedded or real-time systems. The experiments on Cityscapes are meant to show this architecture improves the speed-accuracy frontier compared to prior real-time methods.

Core claim

ESNet consists of a series of factorized convolution units and parallel factorized convolution units that together form a symmetric network. This design achieves state-of-the-art results in the speed and accuracy trade-off for real-time semantic segmentation on the Cityscapes dataset, with the model having nearly 1.6 million parameters and running at over 62 FPS on a GTX 1080Ti GPU.

What carries the argument

The factorized convolution unit (FCU) and parallel factorized convolution unit (PFCU), where PFCU uses a transform-split-transform-merge strategy with dilated convolutions.

If this is right

  • The low parameter count supports deployment in resource-constrained environments.
  • Real-time performance above 60 FPS enables applications in video analysis and autonomous systems.
  • The symmetric structure maintains segmentation accuracy without excessive computation.
  • Results on Cityscapes suggest the design competes favorably with existing real-time segmentation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the symmetric factorized approach to other tasks like instance segmentation could yield similar efficiency gains.
  • Evaluating the model on diverse hardware platforms would clarify its portability beyond the reported GTX 1080Ti setup.
  • The use of dilated convolutions in parallel branches may offer insights for receptive field design in other efficient architectures.

Load-bearing premise

The assumption that performance on Cityscapes validation and test sets with one hardware setup sufficiently demonstrates superiority for real-time semantic segmentation in general.

What would settle it

Demonstrating on another dataset or hardware that an alternative network achieves a superior combination of accuracy and frames per second.

Figures

Figures reproduced from arXiv: 1906.09826 by Quan Zhou, Xiaofu Wu, Yu Wang.

Figure 1
Figure 1. Figure 1: Overall symmetric architecture of the proposed ESNet. The entire network is composed by four components: down-sampling unit, upsampling unit, factorized convolution unit and its parallel version. (Best viewed in color) convolution stride significantly reduce the dimension of feature representation, thereby losing much of the finer image structure. In order to address this prob￾lem, a more deeper architectu… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different residual layer modules. From left to right are (a) Non￾bottleneck [2], (b) Bottleneck [17], (c) Non-bottleneck-1D [19], (d) FCU and (e) PFCU module. “DConv” denotes the dilated convolution, where r1, r2, and r3 are dilated rates for each split branch, respectively. are employed, where the first one uses factorized convolution to extract low-level features, and the second one utilize… view at source ↗
Figure 3
Figure 3. Figure 3: The visual comparison on CityScapes val dataset. From left to right are input images, ground truth, segmentation outputs from our ESNet, SegNet [8], ENet [17], ERFNet [19], ESPNet [18], ICNet [35], and CGNet [15]. (Best viewed in color) networks, we propose an ESNet that completely leverages its benefits to reach state-of-the-art segmentation accuracy and efficiency. The experimental results show that our … view at source ↗
read the original abstract

The recent years have witnessed great advances for semantic segmentation using deep convolutional neural networks (DCNNs). However, a large number of convolutional layers and feature channels lead to semantic segmentation as a computationally heavy task, which is disadvantage to the scenario with limited resources. In this paper, we design an efficient symmetric network, called (ESNet), to address this problem. The whole network has nearly symmetric architecture, which is mainly composed of a series of factorized convolution unit (FCU) and its parallel counterparts (PFCU). On one hand, the FCU adopts a widely-used 1D factorized convolution in residual layers. On the other hand, the parallel version employs a transform-split-transform-merge strategy in the designment of residual module, where the split branch adopts dilated convolutions with different rate to enlarge receptive field. Our model has nearly 1.6M parameters, and is able to be performed over 62 FPS on a single GTX 1080Ti GPU. The experiments demonstrate that our approach achieves state-of-the-art results in terms of speed and accuracy trade-off for real-time semantic segmentation on CityScapes dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes ESNet, a symmetric encoder-decoder network for real-time semantic segmentation. The architecture is built from factorized convolution units (FCU) that apply 1D factorized convolutions within residual blocks and parallel factorized convolution units (PFCU) that follow a transform-split-transform-merge pattern using dilated convolutions at multiple rates to expand receptive fields. The authors state that the resulting model contains approximately 1.6 million parameters and achieves more than 62 FPS on a GTX 1080Ti GPU while attaining state-of-the-art speed-accuracy trade-off on the Cityscapes dataset.

Significance. If the reported Cityscapes results hold, the work supplies a concrete, low-parameter architecture that improves the speed-accuracy frontier for real-time segmentation. The explicit parameter count and single-GPU FPS figure, together with the modular FCU/PFCU design, constitute a reproducible engineering contribution that can be directly compared against other lightweight segmentation models.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'state-of-the-art results' is not accompanied by any numerical accuracy metric (e.g., mIoU); adding the key quantitative numbers would allow readers to evaluate the trade-off immediately.
  2. [Abstract] The description of the PFCU 'transform-split-transform-merge' strategy is terse; a short diagram or one-sentence statement of how the split branches are recombined before the residual addition would remove ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of ESNet, the recognition of its engineering contribution, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical architecture proposal describing FCU/PFCU residual modules for semantic segmentation. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. Claims rest on external experimental benchmarks (Cityscapes FPS/accuracy) rather than any self-referential reduction. Self-citations, if present, are not load-bearing for any central result.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The design rests on standard CNN assumptions (residual connections aid optimization, dilated convolutions enlarge receptive field without extra parameters) and on empirical choices of dilation rates and split ratios that are not quantified in the abstract.

free parameters (2)
  • dilation rates in PFCU
    Multiple dilation rates are chosen to enlarge receptive field; exact values and selection method are not stated.
  • channel widths and block counts
    Architecture hyperparameters that determine the 1.6 M parameter count are not listed.
axioms (2)
  • domain assumption Factorized 1D convolutions preserve sufficient representational power for segmentation
    Invoked when the FCU is presented as a drop-in replacement for standard convolutions.
  • domain assumption Transform-split-transform-merge with parallel dilated branches improves accuracy without harming speed
    Core justification for the PFCU design.

pith-pipeline@v0.9.0 · 5727 in / 1526 out tokens · 30157 ms · 2026-05-25T17:48:01.349550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 9 internal anchors

  1. [1]

    In: NIPS

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012) 1097–1105

  2. [2]

    In: CVPR

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016) 770–778

  3. [3]

    In: CVPR

    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu- rate object detection and semantic segmentation. In: CVPR. (2014) 580–587

  4. [4]

    IEEE TPAMI 39 (2017) 640–651

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE TPAMI 39 (2017) 640–651

  5. [5]

    IEEE TPAMI 40 (2018) 834–848 ESNet for Real-time Semantic Segmentation 11

    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40 (2018) 834–848 ESNet for Real-time Semantic Segmentation 11

  6. [6]

    In: CVPR

    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.Y.: Pyramid scene parsing network. In: CVPR. (2016) 6230–6239

  7. [7]

    In: CVPR

    Xiaoxiao, L., Zhiwei, L., Ping, L., Chenchange, L., Xiaoou, T.: Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In: CVPR. (2017) 6459–6468

  8. [8]

    SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

    Badrinarayanan, V., Alex, K., Roberto, C.: Segnet: A deep convolutional encoder- decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)

  9. [9]

    In: CVPR

    Guosheng, L., Anton, M., Chunhua, S., Reid, I.: Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR. (2017) 5168–5177

  10. [10]

    In: ICCV

    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmen- tation. In: ICCV. (2015) 1520–1528

  11. [11]

    In: CVPR

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015) 1–9

  12. [12]

    In: CVPR

    Peng, C., Xiangyu, Z., Gang, Y., Guiming, L., Jian, S.: Large kernel matters: Improve semantic segmentation by global convolutional network. In: CVPR. (2017) 1743–1751

  13. [13]

    IEEE TPAMI 40 (2018) 1352–1366

    Lin, G.S., Shen, C.H., Van, D.H., Reid, I.: Exploring context with deep structured models for semantic segmentation. IEEE TPAMI 40 (2018) 1352–1366

  14. [14]

    In: ICASSP

    Cong, D., Zhou, Q., Chen, J., Wu, X., Zhang, S., Ou, W., Lu, H.: Can: Contextual aggregating network for semantic segmentation. In: ICASSP. (2019) accepted

  15. [15]

    CGNet: A Light-weight Context Guided Network for Semantic Segmentation

    Wu, T.Y., Tang, S., Zhang, R., Zhang, Y.D.: Cgnet: A light-weight context guided network for semantic segmentation. In: arXiv preprint arXiv:1811.08201v1. (2018)

  16. [16]

    In: NIPS Workshop

    Treml, M., Arjona-Medina, J., Mayr, A., Heusel, M., Widrich, M., Bodenhofer, U., Nessler, B., Hochreiter, S.: Speeding up semantic segmentation for autonomous driving. In: NIPS Workshop. (2016) 1–7

  17. [17]

    ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

    Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network architecture for real-time semantic segmentation. In: arXiv preprint arXiv:1606.02147. (2016)

  18. [18]

    ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

    Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Effi- cient spatial pyramid of dilated convolutions for semantic segmentation. In: arXiv preprint arXiv:1803.06815v3. (2018)

  19. [19]

    IEEE TITS 19 (2018) 263–272

    Romera, E., Alvarez, J.M., Bergasa, L.M., Arroyo, R.: Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE TITS 19 (2018) 263–272

  20. [20]

    In: CVPR

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR. (2016) 3213–3223

  21. [21]

    IEEE TPAMI 35 (2013) 1915–1929

    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE TPAMI 35 (2013) 1915–1929

  22. [22]

    In: WACV

    Panqu, W., Pengfei, C., Ye, Y., Ding, L., Zehua, H., Xiaodi, H., Cottrell, G.: Understanding convolution for semantic segmentation. In: WACV. (2018) 1451– 1460

  23. [23]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh, C., George, P., F., S., H., A.: Rethinking atrous convolution for semantic image segmentation. In: arXiv:1706.05587. (2017)

  24. [24]

    Multi-Scale Context Aggregation by Dilated Convolutions

    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  25. [25]

    In: MICCAI

    Ronneberger, O., Philipp, F., Thomas, B.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI. (2015) 225–233 12 Y. Wang et al

  26. [26]

    In: CVPR

    Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR. (2017) 3309–3318

  27. [27]

    In: CVPR

    Islam, M.A., Rochan, M., Bruce, N.D.B., Wang, Y.: Gated feedback refinement network for dense image labeling. In: CVPR. (2017) 4877–4885

  28. [28]

    IJCV 111 (2015) 98–136

    Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111 (2015) 98–136

  29. [29]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., W.Wang, Weyand, T., An- dreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. In: arXiv preprint arXiv:1704.04861. (2017)

  30. [30]

    In: ECCV

    Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classifi- cation using binary convolutional neural networks. In: ECCV. (2016)

  31. [31]

    In: CVPR

    Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolu- tional neural network for mobile devices. In: CVPR. (2018) 6848–6856

  32. [32]

    In: CVPR

    Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In: CVPR. (2016) 5168–5177

  33. [33]

    In: CVPR

    Xie, X., Girshick, R., Dollar, P., Tu, Z.W., He, K.M.: Aggregated residual trans- formations for deep neural networks. In: CVPR. (2017) 5987–5995

  34. [34]

    BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

    Changqian, Y., Jingbo, W., Chao, P., Changxin, G., Gang, Y., Nong, S.: Bisenet: Bilateral segmentation network for real-time semantic segmentation. In: arXiv preprint arXiv:1808.00897. (2018)

  35. [35]

    ICNet for Real-Time Semantic Segmentation on High-Resolution Images

    Zhao, H.S., Qi, X.J., Shen, X.Y., Shi, J.P., Jia, J.Y.: Icnet for real-time semantic segmentation on high-resolution images. In: arXiv preprint arXiv:1704.08545v2. (2018)

  36. [36]

    In: CVPR

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: CVPR. (2016) 2818–2826

  37. [37]

    IEEE TII (2019) accepted

    Zhang, X., Cheny, Z., Wu, Q.M.J., Cai, L., Lu, D., Li, X.: Fast semantic segmen- tation for scene perception. IEEE TII (2019) accepted