pith. machine review for the scientific record. sign in

arxiv: 2604.23415 · v1 · submitted 2026-04-25 · 💻 cs.CV

Recognition: unknown

A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords heterogeneous two-streamaction recognitionoptical flowfusion strategiescross-attentionweighted fusionUCF11UCF50
0
0 comments X

The pith

Assigning different backbones to RGB and optical flow streams improves action recognition over uniform designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using a pretrained vision transformer for RGB frames and a convolutional network trained from scratch for stacked optical flow, instead of the same backbone for both modalities. It evaluates five fusion methods after projecting features to a shared dimension and reports that cross-attention reaches 98.12 percent accuracy on the smaller UCF11 dataset compared to 95.94 percent for an RGB-only baseline. On the larger UCF50 dataset, weighted fusion achieves 96.86 percent and emerges as the most consistent option. The learned fusion weights show nearly equal contributions from both streams on UCF11 but a slight preference for RGB on UCF50. These patterns indicate that modality-specific architectures can deliver measurable gains while keeping the motion stream lightweight.

Core claim

DualStreamHybrid assigns a pretrained ViT-Tiny/16 to RGB frames and a MobileNetV2 trained from scratch to a 20-channel stacked optical flow input. A projection layer aligns the feature dimensions before one of five fusion strategies—late fusion, concatenation, cross-attention, weighted fusion, or gated fusion—is applied. On UCF11, cross-attention achieves 98.12 percent test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94 percent. On UCF50, weighted fusion reaches 96.86 percent and proves most consistent. The learned stream weights indicate near-equal modality contribution on UCF11 (RGB 0.507, flow 0.493) but greater RGB reliance on UCF50 (RGB 0.554, flow 0.446).

What carries the argument

DualStreamHybrid, the heterogeneous two-stream architecture that pairs a pretrained ViT-Tiny/16 for RGB with a from-scratch MobileNetV2 for stacked optical flow, joined by a learned projection layer before applying one of five fusion strategies.

Load-bearing premise

The features produced by the pretrained ViT on RGB and the from-scratch MobileNetV2 on flow are complementary enough that fusion can exploit their interaction without training differences or the projection layer creating hidden biases that explain the accuracy gains.

What would settle it

Reproduce the experiments using identical backbone architectures for both streams and check whether the accuracy improvements over the RGB-only baseline disappear.

Figures

Figures reproduced from arXiv: 2604.23415 by Md. Afzalur Rahaman, Tahmid Rahman.

Figure 1
Figure 1. Figure 1: Overview of the proposed two-stream methodology. Video input is split into two preprocessing branches— view at source ↗
Figure 2
Figure 2. Figure 2: Detailed architecture of DualStreamHybrid. The RGB stream (top) processes a single 224×224 frame through ViT-Tiny/16, producing 196 patch embeddings plus a CLS token. The flow stream (bottom) processes a 20-channel u,v-interleaved stack through a modified MobileNetV2 (with a 20-channel input stem, trained from scratch), followed by AdaptiveAvgPool to produce a 1280-d feature vector. A projection head (Line… view at source ↗
Figure 3
Figure 3. Figure 3: Optical flow extraction pipeline. Each row shows a frame pair: the raw frames at times view at source ↗
Figure 4
Figure 4. Figure 4: Sample frames from UCF11 (left) and UCF50 (right), with four uniformly sampled frames shown per action view at source ↗
Figure 5
Figure 5. Figure 5: Validation accuracy curves for dual-stream fusion strategies. (Left) UCF11: Cross-attention achieves the view at source ↗
Figure 6
Figure 6. Figure 6: Training loss curves for dual-stream fusion strategies. (Left) UCF11: All methods converge rapidly, with view at source ↗
Figure 7
Figure 7. Figure 7: Top-K accuracy comparison across all fusion strategies on UCF11 (left) and UCF50 (right). Hatching patterns allow distinction in greyscale. The legend is shared between both subplots. is expected: a model trained from scratch benefits most from a larger training set, with no pretrained features to fall back on. The RGB-only and dual-stream models, by contrast, stay roughly stable or decline slightly, refle… view at source ↗
Figure 8
Figure 8. Figure 8: Confusion matrices (row-normalised) for all five dual-stream fusion strategies on UCF11. Test accuracy is view at source ↗
read the original abstract

Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DualStreamHybrid, a heterogeneous two-stream architecture for video action recognition that pairs a pretrained ViT-Tiny/16 backbone with RGB frames and a MobileNetV2 backbone (trained from scratch) with 20-channel stacked optical flow. Five fusion strategies (late fusion, concatenation, cross-attention, weighted fusion, gated fusion) are evaluated after a learned projection layer aligns feature dimensions; on UCF11 cross-attention reaches 98.12% (vs. 95.94% RGB-only ViT-Tiny) while on UCF50 weighted fusion reaches 96.86%, with learned stream weights indicating near-equal contribution on UCF11 and slight RGB preference on UCF50.

Significance. If the reported gains are reproducible and the missing UCF50 baselines confirm complementarity, the work would provide concrete evidence that modality-specific backbones outperform symmetric two-stream designs and that optimal fusion varies with dataset scale, offering a practical template for efficient, heterogeneous video models.

major comments (2)
  1. [Abstract] Abstract: the 96.86% weighted-fusion accuracy on UCF50 is stated without any RGB-only ViT-Tiny or flow-only baseline under the same protocol, unlike the explicit 2.18 pp delta provided for UCF11 (98.12% vs 95.94%). Without this comparison the claim that the motion stream 'meaningfully complements' the RGB encoder and that the learned weights (RGB 0.554, flow 0.446) reflect differential contribution cannot be verified.
  2. [Abstract] Abstract: no training protocol (optimizer, schedule, epochs, augmentation), no standard deviations over runs, and no ablation of the projection layer or homogeneous two-stream controls are supplied. These omissions make it impossible to determine whether the accuracy differences arise from the heterogeneous design or from uncontrolled training factors.
minor comments (1)
  1. [Abstract] The abstract lists the five fusion strategies but does not indicate which table or figure reports the full per-strategy accuracies and weights for both datasets; adding an explicit reference would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We have revised the manuscript to incorporate the missing baselines for UCF50 and to provide additional experimental details and controls. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 96.86% weighted-fusion accuracy on UCF50 is stated without any RGB-only ViT-Tiny or flow-only baseline under the same protocol, unlike the explicit 2.18 pp delta provided for UCF11 (98.12% vs 95.94%). Without this comparison the claim that the motion stream 'meaningfully complements' the RGB encoder and that the learned weights (RGB 0.554, flow 0.446) reflect differential contribution cannot be verified.

    Authors: We agree that the abstract should include the UCF50 baselines to make the complementarity claim and the interpretation of the learned weights directly verifiable. We have revised the abstract to report the RGB-only ViT-Tiny and flow-only accuracies on UCF50 under the identical protocol, confirming a meaningful gain from fusion. The learned weights are parameters of the weighted fusion module and are now contextualized with these baselines in the updated abstract; full per-stream results remain in the experimental section. revision: yes

  2. Referee: [Abstract] Abstract: no training protocol (optimizer, schedule, epochs, augmentation), no standard deviations over runs, and no ablation of the projection layer or homogeneous two-stream controls are supplied. These omissions make it impossible to determine whether the accuracy differences arise from the heterogeneous design or from uncontrolled training factors.

    Authors: We acknowledge that the abstract omitted these elements. The training protocol (optimizer, schedule, epochs, and augmentations) is fully specified in Section 4.1 of the manuscript; we have added a concise summary to the revised abstract. To address reproducibility and isolate the effect of the heterogeneous design, we have performed additional runs to report mean and standard deviation, added an ablation of the projection layer, and included homogeneous two-stream controls (ViT-Tiny on both streams and MobileNetV2 on both streams). These new results, now in the experimental section and summarized in the abstract, demonstrate that the observed gains arise from the modality-specific backbones rather than training variations. revision: yes

Circularity Check

0 steps flagged

No circularity: all claims are empirical test accuracies from trained models

full rationale

The paper presents an empirical architecture (DualStreamHybrid with heterogeneous backbones and five fusion strategies) evaluated via measured test-set accuracies on UCF11 and UCF50. Key results such as 98.12% cross-attention accuracy and learned stream weights (0.507/0.493) are post-training observations, not quantities that any equation reduces to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes appear in the provided text as load-bearing steps. The derivation chain consists of direct experimental reporting and is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim rests on standard supervised training of off-the-shelf vision backbones plus a small projection layer; no new physical entities are postulated and the only free parameters are the usual learned weights plus the five fusion operators.

free parameters (2)
  • learned stream weights = RGB ~0.53, flow ~0.47 (dataset-dependent)
    Scalar weights (RGB 0.507/0.554, flow 0.493/0.446) fitted during training to balance modalities in the weighted-fusion variant.
  • projection layer weights
    Linear mapping that aligns ViT and MobileNet feature dimensions before fusion.
axioms (2)
  • domain assumption A pretrained ViT-Tiny/16 extracts useful appearance features from RGB frames
    Invoked when the paper assigns the pretrained ViT to the RGB stream without further justification.
  • domain assumption MobileNetV2 trained from scratch on stacked optical flow captures motion patterns effectively
    Invoked when the paper selects and trains MobileNetV2 specifically for the flow stream.

pith-pipeline@v0.9.0 · 5653 in / 1645 out tokens · 64134 ms · 2026-05-08T08:25:21.917025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review.ACM Computing Surveys, 43(3):16:1–16:43,

  2. [2]

    doi:10.1145/1922649.1922653

  3. [3]

    Alamri, V

    H. Alamri, V . Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, S. Lee, and D. Parikh. Audio visual scene-aware dialog. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. doi:10.1109/CVPR.2019.00774

  4. [4]

    In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid. ViViT: A video vision trans- former. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021. doi:10.1109/ICCV48922.2021.00676

  5. [5]

    Baladaniya and A

    M. Baladaniya and A. K. Choudhary. Artificial intelligence in sports science: A systematic review on performance optimization, injury prevention, and rehabilitation.Journal of Clinical Medicine of Kazakhstan, 22(3):64–72,

  6. [6]

    doi:10.23950/jcmk/16412

  7. [7]

    Bertasius, H

    G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? In International Conference on Machine Learning (ICML), volume 139, pages 813–824, 2021

  8. [8]

    Carreira and A

    J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the Kinetics dataset. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017. doi:10.1109/CVPR.2017.502

  9. [9]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. doi:10.1109/CVPR.2009.5206848

  10. [10]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  11. [11]

    H. Fan, B. Xiong, K. Mangalam, Y . Li, Z. Yan, J. Malik, and C. Feichtenhofer. Multiscale vision trans- formers. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, 2021. doi:10.1109/ICCV48922.2021.00675

  12. [12]

    Farnebäck

    G. Farnebäck. Two-frame motion estimation based on polynomial expansion. InScandinavian Conference on Image Analysis (SCIA), pages 363–370, 2003. doi:10.1007/3-540-45103-X_50. 15 Heterogeneous Two-Stream Framework for Action RecognitionA PREPRINT

  13. [13]

    Feichtenhofer, A

    C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recog- nition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941, 2016. doi:10.1109/CVPR.2016.213

  14. [15]

    Gadzicki, R

    K. Gadzicki, R. Knappe, and C. Zetzsche. Early vs late fusion in multimodal convolutional neu- ral networks. InIEEE International Conference on Information Fusion (FUSION), pages 1–7, 2020. doi:10.23919/FUSION45008.2020.9190246

  15. [16]

    Gammulle, S

    H. Gammulle, S. Denman, S. Sridharan, and C. Fookes. Two stream LSTM: A deep fusion framework for human action recognition. InIEEE Winter Conference on Applications of Computer Vision (WACV), pages 177–186,

  16. [17]

    doi:10.1109/W ACV .2017.27

  17. [18]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi:10.1109/CVPR.2016.90

  18. [19]

    Hussain, K

    A. Hussain, K. Muhammad, A. Ullah, Z. Ahmad, S. W. Baik, and V . H. C. de Albuquerque. Vision transformer and deep sequence learning for human activity recognition in surveillance videos.Computational Intelligence and Neuroscience, 2022:3454167, 2022. doi:10.1155/2022/3454167

  19. [20]

    M. Ilse, J. M. Tomczak, and M. Welling. Attention-based deep multiple instance learning. InInternational Conference on Machine Learning (ICML), volume 80, pages 2132–2141, 2018

  20. [21]

    Karpathy, G

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1725–1732, 2014. doi:10.1109/CVPR.2014.223

  21. [22]

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The Kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  22. [23]

    Kumar, A

    M. Kumar, A. K. Patel, M. Biswas, and S. Shitharth. Attention-based bidirectional-long short-term memory for abnormal human activity detection.Scientific Reports, 13(1):14442, 2023. doi:10.1038/s41598-023-41231-0

  23. [24]

    in the wild

    J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1996–2003, 2009. doi:10.1109/CVPR.2009.5206744

  24. [25]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021. doi:10.1109/ICCV48922.2021.00986

  25. [26]

    Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu. Video Swin transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, 2022. doi:10.1109/CVPR52688.2022.00320

  26. [27]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  27. [28]

    R. Poppe. A survey on vision-based human action recognition.Image and Vision Computing, 28(6):976–990,

  28. [29]

    doi:10.1016/j.imavis.2009.11.014

  29. [30]

    K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos.Machine Vision and Applications, 24(5):971–981, 2013. doi:10.1007/s00138-012-0450-4

  30. [31]

    Sandler, A

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. doi:10.1109/CVPR.2018.00474

  31. [32]

    Simonyan and A

    K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, pages 568–576, 2014. doi:10.48550/arXiv.1406.2199

  32. [33]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015. doi:10.48550/arXiv.1409.1556

  33. [34]

    Soomro, A

    K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. Technical report, CRCV Technical Report CRCV-TR-12-01, 2012

  34. [35]

    D. Sun, X. Yang, M.-Y . Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8934–8943, 2018. doi:10.1109/CVPR.2018.00931. 16 Heterogeneous Two-Stream Framework for Action RecognitionA PREPRINT

  35. [36]

    Teed and J

    Z. Teed and J. Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419, 2020. doi:10.1007/978-3-030-58536-5_24

  36. [37]

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D con- volutional networks. InIEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2015. doi:10.1109/ICCV .2015.510

  37. [38]

    L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. InEuropean Conference on Computer Vision (ECCV), pages 20–36,

  38. [39]

    doi:10.1007/978-3-319-46484-8_2

  39. [40]

    J. Ye, L. Wang, G. Li, D. Chen, S. Zhe, X. Chu, and Z. Xu. Learning compact recurrent neural networks with block-term tensor decomposition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9378–9387, 2018. doi:10.1109/CVPR.2018.00977

  40. [41]

    C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L1 optical flow. InDAGM Symposium on Pattern Recognition, pages 214–223, 2007. doi:10.1007/978-3-540-74936-3_22. 17