Recognition: unknown
A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis
Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3
The pith
Assigning different backbones to RGB and optical flow streams improves action recognition over uniform designs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DualStreamHybrid assigns a pretrained ViT-Tiny/16 to RGB frames and a MobileNetV2 trained from scratch to a 20-channel stacked optical flow input. A projection layer aligns the feature dimensions before one of five fusion strategies—late fusion, concatenation, cross-attention, weighted fusion, or gated fusion—is applied. On UCF11, cross-attention achieves 98.12 percent test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94 percent. On UCF50, weighted fusion reaches 96.86 percent and proves most consistent. The learned stream weights indicate near-equal modality contribution on UCF11 (RGB 0.507, flow 0.493) but greater RGB reliance on UCF50 (RGB 0.554, flow 0.446).
What carries the argument
DualStreamHybrid, the heterogeneous two-stream architecture that pairs a pretrained ViT-Tiny/16 for RGB with a from-scratch MobileNetV2 for stacked optical flow, joined by a learned projection layer before applying one of five fusion strategies.
Load-bearing premise
The features produced by the pretrained ViT on RGB and the from-scratch MobileNetV2 on flow are complementary enough that fusion can exploit their interaction without training differences or the projection layer creating hidden biases that explain the accuracy gains.
What would settle it
Reproduce the experiments using identical backbone architectures for both streams and check whether the accuracy improvements over the RGB-only baseline disappear.
Figures
read the original abstract
Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DualStreamHybrid, a heterogeneous two-stream architecture for video action recognition that pairs a pretrained ViT-Tiny/16 backbone with RGB frames and a MobileNetV2 backbone (trained from scratch) with 20-channel stacked optical flow. Five fusion strategies (late fusion, concatenation, cross-attention, weighted fusion, gated fusion) are evaluated after a learned projection layer aligns feature dimensions; on UCF11 cross-attention reaches 98.12% (vs. 95.94% RGB-only ViT-Tiny) while on UCF50 weighted fusion reaches 96.86%, with learned stream weights indicating near-equal contribution on UCF11 and slight RGB preference on UCF50.
Significance. If the reported gains are reproducible and the missing UCF50 baselines confirm complementarity, the work would provide concrete evidence that modality-specific backbones outperform symmetric two-stream designs and that optimal fusion varies with dataset scale, offering a practical template for efficient, heterogeneous video models.
major comments (2)
- [Abstract] Abstract: the 96.86% weighted-fusion accuracy on UCF50 is stated without any RGB-only ViT-Tiny or flow-only baseline under the same protocol, unlike the explicit 2.18 pp delta provided for UCF11 (98.12% vs 95.94%). Without this comparison the claim that the motion stream 'meaningfully complements' the RGB encoder and that the learned weights (RGB 0.554, flow 0.446) reflect differential contribution cannot be verified.
- [Abstract] Abstract: no training protocol (optimizer, schedule, epochs, augmentation), no standard deviations over runs, and no ablation of the projection layer or homogeneous two-stream controls are supplied. These omissions make it impossible to determine whether the accuracy differences arise from the heterogeneous design or from uncontrolled training factors.
minor comments (1)
- [Abstract] The abstract lists the five fusion strategies but does not indicate which table or figure reports the full per-strategy accuracies and weights for both datasets; adding an explicit reference would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We have revised the manuscript to incorporate the missing baselines for UCF50 and to provide additional experimental details and controls. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 96.86% weighted-fusion accuracy on UCF50 is stated without any RGB-only ViT-Tiny or flow-only baseline under the same protocol, unlike the explicit 2.18 pp delta provided for UCF11 (98.12% vs 95.94%). Without this comparison the claim that the motion stream 'meaningfully complements' the RGB encoder and that the learned weights (RGB 0.554, flow 0.446) reflect differential contribution cannot be verified.
Authors: We agree that the abstract should include the UCF50 baselines to make the complementarity claim and the interpretation of the learned weights directly verifiable. We have revised the abstract to report the RGB-only ViT-Tiny and flow-only accuracies on UCF50 under the identical protocol, confirming a meaningful gain from fusion. The learned weights are parameters of the weighted fusion module and are now contextualized with these baselines in the updated abstract; full per-stream results remain in the experimental section. revision: yes
-
Referee: [Abstract] Abstract: no training protocol (optimizer, schedule, epochs, augmentation), no standard deviations over runs, and no ablation of the projection layer or homogeneous two-stream controls are supplied. These omissions make it impossible to determine whether the accuracy differences arise from the heterogeneous design or from uncontrolled training factors.
Authors: We acknowledge that the abstract omitted these elements. The training protocol (optimizer, schedule, epochs, and augmentations) is fully specified in Section 4.1 of the manuscript; we have added a concise summary to the revised abstract. To address reproducibility and isolate the effect of the heterogeneous design, we have performed additional runs to report mean and standard deviation, added an ablation of the projection layer, and included homogeneous two-stream controls (ViT-Tiny on both streams and MobileNetV2 on both streams). These new results, now in the experimental section and summarized in the abstract, demonstrate that the observed gains arise from the modality-specific backbones rather than training variations. revision: yes
Circularity Check
No circularity: all claims are empirical test accuracies from trained models
full rationale
The paper presents an empirical architecture (DualStreamHybrid with heterogeneous backbones and five fusion strategies) evaluated via measured test-set accuracies on UCF11 and UCF50. Key results such as 98.12% cross-attention accuracy and learned stream weights (0.507/0.493) are post-training observations, not quantities that any equation reduces to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes appear in the provided text as load-bearing steps. The derivation chain consists of direct experimental reporting and is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- learned stream weights =
RGB ~0.53, flow ~0.47 (dataset-dependent)
- projection layer weights
axioms (2)
- domain assumption A pretrained ViT-Tiny/16 extracts useful appearance features from RGB frames
- domain assumption MobileNetV2 trained from scratch on stacked optical flow captures motion patterns effectively
Reference graph
Works this paper leans on
-
[1]
J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review.ACM Computing Surveys, 43(3):16:1–16:43,
-
[2]
doi:10.1145/1922649.1922653
-
[3]
H. Alamri, V . Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, S. Lee, and D. Parikh. Audio visual scene-aware dialog. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. doi:10.1109/CVPR.2019.00774
-
[4]
In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid. ViViT: A video vision trans- former. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021. doi:10.1109/ICCV48922.2021.00676
-
[5]
Baladaniya and A
M. Baladaniya and A. K. Choudhary. Artificial intelligence in sports science: A systematic review on performance optimization, injury prevention, and rehabilitation.Journal of Clinical Medicine of Kazakhstan, 22(3):64–72,
-
[6]
doi:10.23950/jcmk/16412
-
[7]
Bertasius, H
G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? In International Conference on Machine Learning (ICML), volume 139, pages 813–824, 2021
2021
-
[8]
J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the Kinetics dataset. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017. doi:10.1109/CVPR.2017.502
-
[9]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. doi:10.1109/CVPR.2009.5206848
-
[10]
Dosovitskiy, L
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021
2021
-
[11]
H. Fan, B. Xiong, K. Mangalam, Y . Li, Z. Yan, J. Malik, and C. Feichtenhofer. Multiscale vision trans- formers. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, 2021. doi:10.1109/ICCV48922.2021.00675
-
[12]
G. Farnebäck. Two-frame motion estimation based on polynomial expansion. InScandinavian Conference on Image Analysis (SCIA), pages 363–370, 2003. doi:10.1007/3-540-45103-X_50. 15 Heterogeneous Two-Stream Framework for Action RecognitionA PREPRINT
-
[13]
C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recog- nition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941, 2016. doi:10.1109/CVPR.2016.213
-
[15]
K. Gadzicki, R. Knappe, and C. Zetzsche. Early vs late fusion in multimodal convolutional neu- ral networks. InIEEE International Conference on Information Fusion (FUSION), pages 1–7, 2020. doi:10.23919/FUSION45008.2020.9190246
-
[16]
Gammulle, S
H. Gammulle, S. Denman, S. Sridharan, and C. Fookes. Two stream LSTM: A deep fusion framework for human action recognition. InIEEE Winter Conference on Applications of Computer Vision (WACV), pages 177–186,
-
[17]
doi:10.1109/W ACV .2017.27
work page doi:10.1109/w 2017
-
[18]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi:10.1109/CVPR.2016.90
-
[19]
A. Hussain, K. Muhammad, A. Ullah, Z. Ahmad, S. W. Baik, and V . H. C. de Albuquerque. Vision transformer and deep sequence learning for human activity recognition in surveillance videos.Computational Intelligence and Neuroscience, 2022:3454167, 2022. doi:10.1155/2022/3454167
-
[20]
M. Ilse, J. M. Tomczak, and M. Welling. Attention-based deep multiple instance learning. InInternational Conference on Machine Learning (ICML), volume 80, pages 2132–2141, 2018
2018
-
[21]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1725–1732, 2014. doi:10.1109/CVPR.2014.223
-
[22]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The Kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review arXiv 2017
-
[23]
M. Kumar, A. K. Patel, M. Biswas, and S. Shitharth. Attention-based bidirectional-long short-term memory for abnormal human activity detection.Scientific Reports, 13(1):14442, 2023. doi:10.1038/s41598-023-41231-0
-
[24]
J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1996–2003, 2009. doi:10.1109/CVPR.2009.5206744
-
[25]
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021. doi:10.1109/ICCV48922.2021.00986
-
[26]
Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu. Video Swin transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, 2022. doi:10.1109/CVPR52688.2022.00320
-
[27]
Loshchilov and F
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019
2019
-
[28]
R. Poppe. A survey on vision-based human action recognition.Image and Vision Computing, 28(6):976–990,
-
[29]
doi:10.1016/j.imavis.2009.11.014
-
[30]
K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos.Machine Vision and Applications, 24(5):971–981, 2013. doi:10.1007/s00138-012-0450-4
-
[31]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. doi:10.1109/CVPR.2018.00474
-
[32]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, pages 568–576, 2014. doi:10.48550/arXiv.1406.2199
-
[33]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015. doi:10.48550/arXiv.1409.1556
-
[34]
Soomro, A
K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. Technical report, CRCV Technical Report CRCV-TR-12-01, 2012
2012
-
[35]
D. Sun, X. Yang, M.-Y . Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8934–8943, 2018. doi:10.1109/CVPR.2018.00931. 16 Heterogeneous Two-Stream Framework for Action RecognitionA PREPRINT
-
[36]
Z. Teed and J. Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419, 2020. doi:10.1007/978-3-030-58536-5_24
-
[37]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D con- volutional networks. InIEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2015. doi:10.1109/ICCV .2015.510
-
[38]
L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. InEuropean Conference on Computer Vision (ECCV), pages 20–36,
-
[39]
doi:10.1007/978-3-319-46484-8_2
-
[40]
J. Ye, L. Wang, G. Li, D. Chen, S. Zhe, X. Chu, and Z. Xu. Learning compact recurrent neural networks with block-term tensor decomposition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9378–9387, 2018. doi:10.1109/CVPR.2018.00977
-
[41]
C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L1 optical flow. InDAGM Symposium on Pattern Recognition, pages 214–223, 2007. doi:10.1007/978-3-540-74936-3_22. 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.