arxiv: 2604.23415 · v1 · submitted 2026-04-25 · 💻 cs.CV

Recognition: unknown

A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis

Md. Afzalur Rahaman , Tahmid Rahman

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords heterogeneous two-streamaction recognitionoptical flowfusion strategiescross-attentionweighted fusionUCF11UCF50

0 comments

The pith

Assigning different backbones to RGB and optical flow streams improves action recognition over uniform designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using a pretrained vision transformer for RGB frames and a convolutional network trained from scratch for stacked optical flow, instead of the same backbone for both modalities. It evaluates five fusion methods after projecting features to a shared dimension and reports that cross-attention reaches 98.12 percent accuracy on the smaller UCF11 dataset compared to 95.94 percent for an RGB-only baseline. On the larger UCF50 dataset, weighted fusion achieves 96.86 percent and emerges as the most consistent option. The learned fusion weights show nearly equal contributions from both streams on UCF11 but a slight preference for RGB on UCF50. These patterns indicate that modality-specific architectures can deliver measurable gains while keeping the motion stream lightweight.

Core claim

DualStreamHybrid assigns a pretrained ViT-Tiny/16 to RGB frames and a MobileNetV2 trained from scratch to a 20-channel stacked optical flow input. A projection layer aligns the feature dimensions before one of five fusion strategies—late fusion, concatenation, cross-attention, weighted fusion, or gated fusion—is applied. On UCF11, cross-attention achieves 98.12 percent test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94 percent. On UCF50, weighted fusion reaches 96.86 percent and proves most consistent. The learned stream weights indicate near-equal modality contribution on UCF11 (RGB 0.507, flow 0.493) but greater RGB reliance on UCF50 (RGB 0.554, flow 0.446).

What carries the argument

DualStreamHybrid, the heterogeneous two-stream architecture that pairs a pretrained ViT-Tiny/16 for RGB with a from-scratch MobileNetV2 for stacked optical flow, joined by a learned projection layer before applying one of five fusion strategies.

Load-bearing premise

The features produced by the pretrained ViT on RGB and the from-scratch MobileNetV2 on flow are complementary enough that fusion can exploit their interaction without training differences or the projection layer creating hidden biases that explain the accuracy gains.

What would settle it

Reproduce the experiments using identical backbone architectures for both streams and check whether the accuracy improvements over the RGB-only baseline disappear.

Figures

Figures reproduced from arXiv: 2604.23415 by Md. Afzalur Rahaman, Tahmid Rahman.

**Figure 1.** Figure 1: Overview of the proposed two-stream methodology. Video input is split into two preprocessing branches— view at source ↗

**Figure 2.** Figure 2: Detailed architecture of DualStreamHybrid. The RGB stream (top) processes a single 224×224 frame through ViT-Tiny/16, producing 196 patch embeddings plus a CLS token. The flow stream (bottom) processes a 20-channel u,v-interleaved stack through a modified MobileNetV2 (with a 20-channel input stem, trained from scratch), followed by AdaptiveAvgPool to produce a 1280-d feature vector. A projection head (Line… view at source ↗

**Figure 3.** Figure 3: Optical flow extraction pipeline. Each row shows a frame pair: the raw frames at times view at source ↗

**Figure 4.** Figure 4: Sample frames from UCF11 (left) and UCF50 (right), with four uniformly sampled frames shown per action view at source ↗

**Figure 5.** Figure 5: Validation accuracy curves for dual-stream fusion strategies. (Left) UCF11: Cross-attention achieves the view at source ↗

**Figure 6.** Figure 6: Training loss curves for dual-stream fusion strategies. (Left) UCF11: All methods converge rapidly, with view at source ↗

**Figure 7.** Figure 7: Top-K accuracy comparison across all fusion strategies on UCF11 (left) and UCF50 (right). Hatching patterns allow distinction in greyscale. The legend is shared between both subplots. is expected: a model trained from scratch benefits most from a larger training set, with no pretrained features to fall back on. The RGB-only and dual-stream models, by contrast, stay roughly stable or decline slightly, refle… view at source ↗

**Figure 8.** Figure 8: Confusion matrices (row-normalised) for all five dual-stream fusion strategies on UCF11. Test accuracy is view at source ↗

read the original abstract

Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a modest gain on UCF11 from heterogeneous backbones and cross-attention fusion, but the UCF50 results lack the necessary RGB-only baseline to confirm the benefit.

read the letter

The core takeaway here is that pairing a small ViT for RGB with MobileNetV2 on stacked flow, then fusing them in various ways, gives a nice lift on the tiny UCF11 set but leaves the larger UCF50 claim a bit hanging. What stands out as new is the explicit choice of mismatched backbones for the two modalities and the side-by-side test of five fusion methods in one setup. They also track the learned weights and note how they shift with dataset size. That part feels like a useful empirical observation rather than just another accuracy number. The work does a decent job of showing that cross-attention helps on the smaller benchmark, beating the RGB-only ViT by about 2 points. The weight patterns they report make some sense too, with more balanced contribution on UCF11. The soft spot is exactly what the stress test flags: on UCF50 they only give the fused accuracy of 96.86% without showing what the same ViT-Tiny gets alone under the same training. Without that delta it's tough to know if the flow stream is actually adding value or if the number is just from the appearance model. They also skip training details, statistical significance, and any ablation on the projection layer, which makes it harder to trust how much the heterogeneity matters versus other choices. This kind of paper is for people who build practical video models on limited data and want to see how fusion choices play out. It won't change the field, but the specific setup and numbers could be worth checking if you're tuning two-stream systems. I'd send it for peer review. The missing baseline is fixable, and once added the rest looks like honest incremental work that deserves a look from referees who know the UCF benchmarks.

Referee Report

2 major / 1 minor

Summary. The paper proposes DualStreamHybrid, a heterogeneous two-stream architecture for video action recognition that pairs a pretrained ViT-Tiny/16 backbone with RGB frames and a MobileNetV2 backbone (trained from scratch) with 20-channel stacked optical flow. Five fusion strategies (late fusion, concatenation, cross-attention, weighted fusion, gated fusion) are evaluated after a learned projection layer aligns feature dimensions; on UCF11 cross-attention reaches 98.12% (vs. 95.94% RGB-only ViT-Tiny) while on UCF50 weighted fusion reaches 96.86%, with learned stream weights indicating near-equal contribution on UCF11 and slight RGB preference on UCF50.

Significance. If the reported gains are reproducible and the missing UCF50 baselines confirm complementarity, the work would provide concrete evidence that modality-specific backbones outperform symmetric two-stream designs and that optimal fusion varies with dataset scale, offering a practical template for efficient, heterogeneous video models.

major comments (2)

[Abstract] Abstract: the 96.86% weighted-fusion accuracy on UCF50 is stated without any RGB-only ViT-Tiny or flow-only baseline under the same protocol, unlike the explicit 2.18 pp delta provided for UCF11 (98.12% vs 95.94%). Without this comparison the claim that the motion stream 'meaningfully complements' the RGB encoder and that the learned weights (RGB 0.554, flow 0.446) reflect differential contribution cannot be verified.
[Abstract] Abstract: no training protocol (optimizer, schedule, epochs, augmentation), no standard deviations over runs, and no ablation of the projection layer or homogeneous two-stream controls are supplied. These omissions make it impossible to determine whether the accuracy differences arise from the heterogeneous design or from uncontrolled training factors.

minor comments (1)

[Abstract] The abstract lists the five fusion strategies but does not indicate which table or figure reports the full per-strategy accuracies and weights for both datasets; adding an explicit reference would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We have revised the manuscript to incorporate the missing baselines for UCF50 and to provide additional experimental details and controls. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the 96.86% weighted-fusion accuracy on UCF50 is stated without any RGB-only ViT-Tiny or flow-only baseline under the same protocol, unlike the explicit 2.18 pp delta provided for UCF11 (98.12% vs 95.94%). Without this comparison the claim that the motion stream 'meaningfully complements' the RGB encoder and that the learned weights (RGB 0.554, flow 0.446) reflect differential contribution cannot be verified.

Authors: We agree that the abstract should include the UCF50 baselines to make the complementarity claim and the interpretation of the learned weights directly verifiable. We have revised the abstract to report the RGB-only ViT-Tiny and flow-only accuracies on UCF50 under the identical protocol, confirming a meaningful gain from fusion. The learned weights are parameters of the weighted fusion module and are now contextualized with these baselines in the updated abstract; full per-stream results remain in the experimental section. revision: yes
Referee: [Abstract] Abstract: no training protocol (optimizer, schedule, epochs, augmentation), no standard deviations over runs, and no ablation of the projection layer or homogeneous two-stream controls are supplied. These omissions make it impossible to determine whether the accuracy differences arise from the heterogeneous design or from uncontrolled training factors.

Authors: We acknowledge that the abstract omitted these elements. The training protocol (optimizer, schedule, epochs, and augmentations) is fully specified in Section 4.1 of the manuscript; we have added a concise summary to the revised abstract. To address reproducibility and isolate the effect of the heterogeneous design, we have performed additional runs to report mean and standard deviation, added an ablation of the projection layer, and included homogeneous two-stream controls (ViT-Tiny on both streams and MobileNetV2 on both streams). These new results, now in the experimental section and summarized in the abstract, demonstrate that the observed gains arise from the modality-specific backbones rather than training variations. revision: yes

Circularity Check

0 steps flagged

No circularity: all claims are empirical test accuracies from trained models

full rationale

The paper presents an empirical architecture (DualStreamHybrid with heterogeneous backbones and five fusion strategies) evaluated via measured test-set accuracies on UCF11 and UCF50. Key results such as 98.12% cross-attention accuracy and learned stream weights (0.507/0.493) are post-training observations, not quantities that any equation reduces to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes appear in the provided text as load-bearing steps. The derivation chain consists of direct experimental reporting and is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim rests on standard supervised training of off-the-shelf vision backbones plus a small projection layer; no new physical entities are postulated and the only free parameters are the usual learned weights plus the five fusion operators.

free parameters (2)

learned stream weights = RGB ~0.53, flow ~0.47 (dataset-dependent)
Scalar weights (RGB 0.507/0.554, flow 0.493/0.446) fitted during training to balance modalities in the weighted-fusion variant.
projection layer weights
Linear mapping that aligns ViT and MobileNet feature dimensions before fusion.

axioms (2)

domain assumption A pretrained ViT-Tiny/16 extracts useful appearance features from RGB frames
Invoked when the paper assigns the pretrained ViT to the RGB stream without further justification.
domain assumption MobileNetV2 trained from scratch on stacked optical flow captures motion patterns effectively
Invoked when the paper selects and trains MobileNetV2 specifically for the flow stream.

pith-pipeline@v0.9.0 · 5653 in / 1645 out tokens · 64134 ms · 2026-05-08T08:25:21.917025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 30 canonical work pages · 1 internal anchor

[1]

J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review.ACM Computing Surveys, 43(3):16:1–16:43,
[2]

doi:10.1145/1922649.1922653

work page doi:10.1145/1922649.1922653
[3]

Alamri, V

H. Alamri, V . Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, S. Lee, and D. Parikh. Audio visual scene-aware dialog. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. doi:10.1109/CVPR.2019.00774

work page doi:10.1109/cvpr.2019.00774 2019
[4]

In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid. ViViT: A video vision trans- former. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021. doi:10.1109/ICCV48922.2021.00676

work page doi:10.1109/iccv48922.2021.00676 2021
[5]

Baladaniya and A

M. Baladaniya and A. K. Choudhary. Artificial intelligence in sports science: A systematic review on performance optimization, injury prevention, and rehabilitation.Journal of Clinical Medicine of Kazakhstan, 22(3):64–72,
[6]

doi:10.23950/jcmk/16412

work page doi:10.23950/jcmk/16412
[7]

Bertasius, H

G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? In International Conference on Machine Learning (ICML), volume 139, pages 813–824, 2021

2021
[8]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the Kinetics dataset. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017. doi:10.1109/CVPR.2017.502

work page doi:10.1109/cvpr.2017.502 2017
[9]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. doi:10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[10]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021
[11]

H. Fan, B. Xiong, K. Mangalam, Y . Li, Z. Yan, J. Malik, and C. Feichtenhofer. Multiscale vision trans- formers. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, 2021. doi:10.1109/ICCV48922.2021.00675

work page doi:10.1109/iccv48922.2021.00675 2021
[12]

Farnebäck

G. Farnebäck. Two-frame motion estimation based on polynomial expansion. InScandinavian Conference on Image Analysis (SCIA), pages 363–370, 2003. doi:10.1007/3-540-45103-X_50. 15 Heterogeneous Two-Stream Framework for Action RecognitionA PREPRINT

work page doi:10.1007/3-540-45103-x_50 2003
[13]

Feichtenhofer, A

C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recog- nition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941, 2016. doi:10.1109/CVPR.2016.213

work page doi:10.1109/cvpr.2016.213 1933
[15]

Gadzicki, R

K. Gadzicki, R. Knappe, and C. Zetzsche. Early vs late fusion in multimodal convolutional neu- ral networks. InIEEE International Conference on Information Fusion (FUSION), pages 1–7, 2020. doi:10.23919/FUSION45008.2020.9190246

work page doi:10.23919/fusion45008.2020.9190246 2020
[16]

Gammulle, S

H. Gammulle, S. Denman, S. Sridharan, and C. Fookes. Two stream LSTM: A deep fusion framework for human action recognition. InIEEE Winter Conference on Applications of Computer Vision (WACV), pages 177–186,
[17]

doi:10.1109/W ACV .2017.27

work page doi:10.1109/w 2017
[18]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[19]

Hussain, K

A. Hussain, K. Muhammad, A. Ullah, Z. Ahmad, S. W. Baik, and V . H. C. de Albuquerque. Vision transformer and deep sequence learning for human activity recognition in surveillance videos.Computational Intelligence and Neuroscience, 2022:3454167, 2022. doi:10.1155/2022/3454167

work page doi:10.1155/2022/3454167 2022
[20]

M. Ilse, J. M. Tomczak, and M. Welling. Attention-based deep multiple instance learning. InInternational Conference on Machine Learning (ICML), volume 80, pages 2132–2141, 2018

2018
[21]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1725–1732, 2014. doi:10.1109/CVPR.2014.223

work page doi:10.1109/cvpr.2014.223 2014
[22]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The Kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review arXiv 2017
[23]

Kumar, A

M. Kumar, A. K. Patel, M. Biswas, and S. Shitharth. Attention-based bidirectional-long short-term memory for abnormal human activity detection.Scientific Reports, 13(1):14442, 2023. doi:10.1038/s41598-023-41231-0

work page doi:10.1038/s41598-023-41231-0 2023
[24]

in the wild

J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1996–2003, 2009. doi:10.1109/CVPR.2009.5206744

work page doi:10.1109/cvpr.2009.5206744 1996
[25]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021. doi:10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021
[26]

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu. Video Swin transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, 2022. doi:10.1109/CVPR52688.2022.00320

work page doi:10.1109/cvpr52688.2022.00320 2022
[27]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

2019
[28]

R. Poppe. A survey on vision-based human action recognition.Image and Vision Computing, 28(6):976–990,
[29]

doi:10.1016/j.imavis.2009.11.014

work page doi:10.1016/j.imavis.2009.11.014 2009
[30]

K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos.Machine Vision and Applications, 24(5):971–981, 2013. doi:10.1007/s00138-012-0450-4

work page doi:10.1007/s00138-012-0450-4 2013
[31]

Sandler, A

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. doi:10.1109/CVPR.2018.00474

work page doi:10.1109/cvpr.2018.00474 2018
[32]

Simonyan and A

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, pages 568–576, 2014. doi:10.48550/arXiv.1406.2199

work page doi:10.48550/arxiv.1406.2199 2014
[33]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015. doi:10.48550/arXiv.1409.1556

work page Pith review doi:10.48550/arxiv.1409.1556 2015
[34]

Soomro, A

K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. Technical report, CRCV Technical Report CRCV-TR-12-01, 2012

2012
[35]

D. Sun, X. Yang, M.-Y . Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8934–8943, 2018. doi:10.1109/CVPR.2018.00931. 16 Heterogeneous Two-Stream Framework for Action RecognitionA PREPRINT

work page doi:10.1109/cvpr.2018.00931 2018
[36]

Teed and J

Z. Teed and J. Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419, 2020. doi:10.1007/978-3-030-58536-5_24

work page doi:10.1007/978-3-030-58536-5_24 2020
[37]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D con- volutional networks. InIEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2015. doi:10.1109/ICCV .2015.510

work page doi:10.1109/iccv 2015
[38]

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. InEuropean Conference on Computer Vision (ECCV), pages 20–36,
[39]

doi:10.1007/978-3-319-46484-8_2

work page doi:10.1007/978-3-319-46484-8_2
[40]

J. Ye, L. Wang, G. Li, D. Chen, S. Zhe, X. Chu, and Z. Xu. Learning compact recurrent neural networks with block-term tensor decomposition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9378–9387, 2018. doi:10.1109/CVPR.2018.00977

work page doi:10.1109/cvpr.2018.00977 2018
[41]

C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L1 optical flow. InDAGM Symposium on Pattern Recognition, pages 214–223, 2007. doi:10.1007/978-3-540-74936-3_22. 17

work page doi:10.1007/978-3-540-74936-3_22 2007