Separable Convolutional LSTMs for Faster Video Segmentation
Pith reviewed 2026-05-24 21:16 UTC · model grok-4.3
The pith
ConvLSTM cells modified with separable convolutions enable faster video segmentation with comparable accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By generalizing spatial and depthwise separable convolutions to convLSTM cells, the number of parameters and required FLOPs are reduced significantly. Segmentation approaches using these modified cells achieve similar or slightly worse accuracy but are up to 15 percent faster on a GPU compared to standard convLSTM versions. A new evaluation metric measures flickering pixels in segmented video sequences.
What carries the argument
The modified convLSTM cells, where spatial and depthwise separable convolutions replace standard ones in the gates and operations.
If this is right
- Video segmentation networks achieve similar performance with reduced computational complexity.
- Inference time for each video frame decreases by up to 15 percent on GPU hardware.
- The new flickering metric provides a quantitative way to evaluate temporal consistency in segmentations.
- The approach maintains the core benefit of temporal modeling while lowering resource demands.
Where Pith is reading between the lines
- The separable modification technique might extend to other recurrent units in video processing pipelines.
- Speed gains could support real-time operation on embedded hardware for robotics applications.
- The flickering metric might serve as a complementary benchmark for any temporal segmentation method.
Load-bearing premise
That the separable convolution replacements in convLSTM cells do not substantially impair the recurrent temporal modeling essential for video segmentation performance.
What would settle it
A direct comparison showing that accuracy degrades beyond slight levels or that the reported speed gains disappear when implemented on different hardware would challenge the central claim.
Figures
read the original abstract
Semantic Segmentation is an important module for autonomous robots such as self-driving cars. The advantage of video segmentation approaches compared to single image segmentation is that temporal image information is considered, and their performance increases due to this. Hence, single image segmentation approaches are extended by recurrent units such as convolutional LSTM (convLSTM) cells, which are placed at suitable positions in the basic network architecture. However, a major critique of video segmentation approaches based on recurrent neural networks is their large parameter count and their computational complexity, and so, their inference time of one video frame takes up to 66 percent longer than their basic version. Inspired by the success of the spatial and depthwise separable convolutional neural networks, we generalize these techniques for convLSTMs in this work, so that the number of parameters and the required FLOPs are reduced significantly. Experiments on different datasets show that the segmentation approaches using the proposed, modified convLSTM cells achieve similar or slightly worse accuracy, but are up to 15 percent faster on a GPU than the ones using the standard convLSTM cells. Furthermore, a new evaluation metric is introduced, which measures the amount of flickering pixels in the segmented video sequence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes to replace the standard convolutions inside convLSTM gates (both input-to-state and state-to-state) with spatial and depthwise separable convolutions, thereby reducing parameter count and FLOPs while preserving the overall video-segmentation architecture. Experiments are reported to show that the resulting models achieve accuracy comparable to (or only slightly below) unmodified convLSTM baselines while delivering up to 15 % GPU speed-up; a new “flickering-pixel” metric is also introduced to quantify temporal instability.
Significance. If the empirical parity claim holds under rigorous controls, the work supplies a practical, drop-in acceleration technique for recurrent video segmentation that could be directly useful for real-time robotics and autonomous-driving pipelines. The new flickering metric is a modest but welcome addition to the evaluation toolkit.
major comments (2)
- [Abstract and Experiments section] The central empirical claim (comparable accuracy with speed gain) rests on experiments whose description supplies neither dataset identities, baseline architectures, number of runs, error bars, nor statistical tests. Without these controls it is impossible to determine whether the reported parity is attributable to the separable convLSTM modification or to the backbone network.
- [§3] §3 (proposed separable convLSTM cell): the manuscript provides no analysis or ablation demonstrating that depthwise separable factorization inside the four gates preserves the temporal state propagation that justifies the use of convLSTMs. If cross-channel mixing is materially reduced, any observed accuracy parity could be an artifact of the spatial backbone rather than evidence that the recurrent component remains functional.
minor comments (2)
- [Abstract] The abstract states “up to 15 percent faster” without specifying the exact hardware, batch size, or input resolution used for the timing measurements.
- [§3] Notation for the separable convolution operators inside the LSTM gates is introduced without an explicit equation relating the factorized kernels to the original full convolution.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the manuscript to improve experimental reporting and add supporting analysis.
read point-by-point responses
-
Referee: [Abstract and Experiments section] The central empirical claim (comparable accuracy with speed gain) rests on experiments whose description supplies neither dataset identities, baseline architectures, number of runs, error bars, nor statistical tests. Without these controls it is impossible to determine whether the reported parity is attributable to the separable convLSTM modification or to the backbone network.
Authors: We agree that the experimental description requires greater explicitness for reproducibility. In the revised manuscript we will explicitly list the dataset identities, baseline architectures, number of runs, error bars, and any statistical tests performed. Because the backbone network is held identical between the standard convLSTM and separable-convLSTM variants, with the sole change being the factorization inside the convLSTM gates, the speed-up and accuracy results can be attributed to the proposed modification. revision: yes
-
Referee: [§3] §3 (proposed separable convLSTM cell): the manuscript provides no analysis or ablation demonstrating that depthwise separable factorization inside the four gates preserves the temporal state propagation that justifies the use of convLSTMs. If cross-channel mixing is materially reduced, any observed accuracy parity could be an artifact of the spatial backbone rather than evidence that the recurrent component remains functional.
Authors: We acknowledge that an explicit ablation would strengthen the claim that the recurrent dynamics are preserved. While the gate structure, recurrent connections, and overall architecture remain unchanged, we will add an ablation study in the revision that examines the effect of the factorization on temporal state propagation (for example, by comparing hidden-state evolution metrics across variants). This will help confirm that the recurrent functionality is retained rather than being an artifact of the backbone. revision: yes
Circularity Check
No circularity; empirical modification tested on external datasets
full rationale
The paper proposes applying spatial and depthwise separable convolutions to the gates of convLSTM cells as an engineering modification, then reports GPU runtime and accuracy on standard video segmentation datasets. No equations, fitted parameters, or self-citations are used to derive the performance claims; the reported speedups and accuracy parity are direct empirical measurements against unmodified baselines. The central claim therefore does not reduce to any input quantity by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Convolutional LSTM cells can be modified with separable convolutions while retaining sufficient temporal modeling power for video segmentation.
Reference graph
Works this paper leans on
-
[1]
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015
Mart ´ın Abadi, Ashish Agarwal, Paul Barham, and Eugene Brevdo et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org
work page 2015
-
[2]
Encoder-decoder with atrous separable convolu- tion for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolu- tion for semantic image segmentation. In ECCV, 2018
work page 2018
-
[3]
Xception: Deep Learning with Depthwise Separable Convolutions
Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
The Cityscapes Dataset for Semantic Urban Scene Understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. CoRR, abs/1604.01685, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
Virtual worlds as proxy for multi-object tracking analysis
A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016
work page 2016
-
[6]
Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll
Juergen T. Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling
-
[7]
Generating Sequences With Recurrent Neural Networks
Alex Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[8]
Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(9):1735–1780, November 1997
work page 1997
-
[9]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[10]
Factorization tricks for LSTM networks
Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. CoRR, abs/1703.10722, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Fast Algorithms for Convolutional Neural Networks
Andrew Lavin. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition
Xiaoyu Liu. Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition
-
[13]
Robust semantic segmentation in adverse weather conditions by means of sensor data fusion
Andreas Pfeuffer and Klaus Dietmayer. Robust semantic segmentation in adverse weather conditions by means of sensor data fusion. In 2019 22nd International Conference on Information Fusion (FUSION) (FUSION 2019) , Ottawa, Canada, July 2019
work page 2019
-
[14]
Semantic segmentation of video sequences with convolutional lstms
Andreas Pfeuffer, Karina Schulz, and Klaus Dietmayer. Semantic segmentation of video sequences with convolutional lstms. In 2019 IEEE Intelligent V ehicles Symposium (IV) , pages 1253 – 1259, 2019
work page 2019
-
[15]
Future semantic segmentation with convolutional lstm, 07 2018
Seyed shahabeddin Nabavi, Mrigank Rochan, Yang , and Wang . Future semantic segmentation with convolutional lstm, 07 2018
work page 2018
-
[16]
Fully Convolutional Networks for Semantic Segmentation
Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolu- tional networks for semantic segmentation. CoRR, abs/1605.06211, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[17]
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. CoRR, abs/1506.04214, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[19]
Ilya Sutskever, Oriol Vinyals, and Quoc V . Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[20]
Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[21]
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[22]
Recurrent fully convolutional networks for video segmentation
Sepehr Valipour, Mennatullah Siam, Martin J ¨agersand, and Nilanjan Ray. Recurrent fully convolutional networks for video segmentation. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 29–36, 2017
work page 2017
-
[23]
E. E. Yurdakul and Y . Yemez. Semantic segmentation of rgbd videos with recurrent fully convolutional neural networks. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) , pages 367–374, Oct 2017
work page 2017
-
[24]
ICNet for Real-Time Semantic Segmentation on High-Resolution Images
Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. CoRR, abs/1704.08545, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. CoRR, abs/1612.01105, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.