arxiv: 2605.12506 · v1 · submitted 2026-03-16 · 💻 cs.CV · cs.AI· cs.HC· cs.RO· eess.IV

Recognition: 1 theorem link

· Lean Theorem

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Abdul Basit , Saim Rehman , Muhammad Shafique

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HCcs.ROeess.IV

keywords gesture detectionon-device inferencetiny-YOLOruntime adaptationenergy efficiencyACE profilesdriver gesturesROI gating

0 comments

The pith

Runtime controller switches among tiny-YOLO variants to cut on-device gesture energy by 4x while holding F1 at 0.8-0.9.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scale-Gest builds a dense family of tiny-YOLO detectors that differ in resolution and stride, each mapped to a device-specific Accuracy-Complexity-Energy profile. A lightweight controller chooses the active profile on the fly according to battery level and latency targets, while a motion-aware ROI gate crops frames to only the moving hand region. On a battery laptop processing continuous gesture streams, this yields 4x lower per-frame energy than any single fixed detector, with mean latency of 6 ms and event-level F1 between 0.8 and 0.9. The authors also release the temporally annotated DSG-18 dataset of driver gestures to support realistic evaluation. If the approach holds, continuous gesture interfaces become feasible on power-constrained devices without sacrificing responsiveness.

Core claim

Scale-Gest expands the detector space into many tiny-YOLO operating points, each assigned a device-calibrated ACE profile. The runtime controller selects the profile that satisfies user and battery constraints; a motion-aware hand-tracking gate then crops the input. On a battery-powered laptop the system reduces per-frame energy from 6.9 mJ to 1.6 mJ, keeps event-level F1 at 0.8-0.9, and reports 6 ms mean latency.

What carries the argument

The ACE profiles: device-calibrated mappings from model resolution and stride to measured accuracy, complexity, and energy points, with the controller selecting among them at runtime.

If this is right

A single fixed detector is no longer required; multiple calibrated points allow continuous operation under varying battery levels.
The ROI gate reduces input complexity without retraining, so latency stays low even when the controller picks a heavier model.
Event-level F1 remains 0.8-0.9 across the tested profiles, showing accuracy does not have to be traded for energy.
The DSG-18 dataset enables reproducible testing of driver-gesture detectors in realistic car scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar ACE-style calibration could be applied to other on-device vision tasks such as object tracking or face detection.
On phones the same controller might enable always-on gesture interfaces that respect strict thermal and battery limits.
Testing the profiles on embedded boards like Jetson Nano or Raspberry Pi would reveal how well the laptop-derived numbers generalize.

Load-bearing premise

The device-calibrated ACE profiles and motion-aware ROI gate will transfer to new hardware platforms and unseen lighting or pose conditions without re-calibration or loss of the reported energy savings.

What would settle it

Measure energy, F1, and latency on a smartphone or different laptop under changed lighting and poses; if the 4x energy reduction disappears without re-calibration, the claim fails.

Figures

Figures reproduced from arXiv: 2605.12506 by Abdul Basit, Muhammad Shafique, Saim Rehman.

**Figure 1.** Figure 1: Temporal and spatial sparsity of driver gestures. (A) A typical DSG-18 gesture episode lasting only 10 frames, surrounded by long background periods, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Device-calibrated Accuracy–Energy–Complexity (ACE) trade-offs. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 5.** Figure 5: Accuracy–efficiency comparison of YOLOv8, v10 synthesized families. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 7.** Figure 7: Dataset Class Distribution: Comparison of sample counts between [PITH_FULL_IMAGE:figures/full_fig_p003_7.png] view at source ↗

**Figure 6.** Figure 6: DSG-18 dataset generation & post-processing pipeline. Run-Time Target (DSG18): To capture temporal behavior under deploymentlike conditions, we curate DSG-18, a video suite aligned one-to-one with the HaGRID gesture classes. Each clip is recorded under a controlled but varied protocol detailed in [PITH_FULL_IMAGE:figures/full_fig_p003_6.png] view at source ↗

**Figure 8.** Figure 8: Concatenated gesture timeline for DSG-18 ( [PITH_FULL_IMAGE:figures/full_fig_p004_8.png] view at source ↗

**Figure 10.** Figure 10: shows that the ACE controller maintains high F1 and low latency while reducing energy per frame and keeping thermals and battery within safe limits. When latency rises, the selector increases the complexity weight 𝛾𝐶 to favor lower-complexity models and restore the target frame rate. Battery Level over time (percent) Realtime Runs on Battery SoC Model Switch Retains F1,Lat [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 9.** Figure 9: ACE surfaces for micro-scale YOLOv8–12, showing the trade-off be [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

read the original abstract

Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scale-Gest shows a practical runtime switch across tiny-YOLO variants plus ROI gating that cuts energy 4x on their laptop testbed, but the numbers are tied to one device with no cross-platform checks.

read the letter

The main takeaway is a runtime controller that picks from a family of tiny-YOLO models using pre-built ACE profiles and a motion-aware ROI crop to drop per-frame energy from 6.9 mJ to 1.6 mJ while holding event F1 at 0.8-0.9 and latency around 6 ms. They also release the DSG-18 driving-gesture dataset to support the evaluation. That combination of dense model points, device-calibrated profiles, and lightweight selection is the concrete advance over a single fixed edge detector. The ROI gate and the way they sweep resolution and stride to create the operating points are straightforward engineering that works on the reported hardware. The energy and latency numbers look usable for always-on vision in cars or phones if they hold up. The clear limitation is that all measurements come from one battery-powered laptop. Hardware differences in memory bandwidth, accelerators, and DVFS will likely shift both the accuracy-complexity curve and the actual joules, so the published profiles may not deliver the same savings elsewhere without re-calibration. The abstract also skips details on how the ACE numbers were measured and validated, which leaves the central claim thinner than it needs to be. This is the sort of paper for groups already working on edge vision pipelines who need a concrete example of runtime adaptation rather than a new theoretical result. It is worth sending to review because the framework is implemented and the dataset is new, even though the next round will need broader hardware tests and fuller experimental reporting to strengthen the claims.

Referee Report

2 major / 1 minor

Summary. The paper presents Scale-Gest, a runtime-adaptive on-device gesture detection framework that synthesizes a dense family of tiny-YOLO models, introduces device-calibrated ACE (Accuracy-Complexity-Energy) profiles for selecting operating points under battery constraints, and employs a motion-aware hand-gesture-tracking ROI gate to reduce input complexity. It also contributes the temporally-annotated DSG-18 dataset for car-driving scenarios and reports that the ACE controller achieves a 4x per-frame energy reduction (6.9 mJ to 1.6 mJ) at event-level F1 of 0.8-0.9 and mean latency of 6 ms on a battery-powered laptop.

Significance. If the energy-accuracy trade-offs and runtime selection mechanism prove robust, the work could meaningfully advance practical deployment of gesture detection on resource-constrained mobile devices by replacing fixed single-detector baselines with a scalable model space and lightweight controller; the introduction of the DSG-18 dataset and explicit focus on battery-aware operation are additional strengths.

major comments (2)

[Abstract] Abstract and Evaluation section: the headline 4x energy reduction (6.9 mJ to 1.6 mJ per frame) and associated ACE profiles are demonstrated exclusively on a single battery-powered laptop; no measurements on other mobile SoCs, accelerators, DVFS regimes, or sensor pipelines are reported, which directly undermines the paper's positioning of Scale-Gest as a general solution for varying on-device platforms.
[Evaluation] Evaluation section: the manuscript provides no description of how the ACE profile thresholds were derived, no error bars or statistical validation on the energy and latency figures, and no ablation isolating the contribution of the ROI gate versus model switching, leaving the quantitative claims difficult to reproduce or generalize.

minor comments (1)

The abstract and introduction refer to 'tiny-YOLO architectures' and 'DSG-18 dataset' without specifying the exact model variants, input resolutions, or dataset statistics (number of sequences, gesture classes, lighting/pose variations), which would aid clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, clarifying the scope of our evaluation while committing to revisions that improve reproducibility and transparency without overstating the current results.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the headline 4x energy reduction (6.9 mJ to 1.6 mJ per frame) and associated ACE profiles are demonstrated exclusively on a single battery-powered laptop; no measurements on other mobile SoCs, accelerators, DVFS regimes, or sensor pipelines are reported, which directly undermines the paper's positioning of Scale-Gest as a general solution for varying on-device platforms.

Authors: We acknowledge that all reported energy, latency, and ACE profile results were obtained on a single battery-powered laptop chosen as a representative mobile platform with variable power constraints. The framework is designed so that ACE profiles are device-calibrated and can be regenerated for other SoCs using the same profiling pipeline; however, we performed no experiments on additional hardware, DVFS settings, or sensor pipelines. In the revised manuscript we will (1) explicitly qualify the evaluation scope in the abstract and evaluation section, (2) add a limitations paragraph describing how the calibration procedure generalizes, and (3) include guidance for practitioners to derive ACE profiles on their target platforms. We cannot supply new cross-platform measurements at this stage. revision: partial
Referee: [Evaluation] Evaluation section: the manuscript provides no description of how the ACE profile thresholds were derived, no error bars or statistical validation on the energy and latency figures, and no ablation isolating the contribution of the ROI gate versus model switching, leaving the quantitative claims difficult to reproduce or generalize.

Authors: We agree that these details are necessary for reproducibility. In the revised manuscript we will add: (1) a step-by-step description of how the ACE thresholds were obtained from the profiling data, including the exact criteria used to select operating points; (2) error bars and standard deviations for all energy and latency figures, computed from repeated measurement runs; and (3) an ablation study that quantifies the separate contributions of the ROI gate and the model-switching controller using the existing experimental traces. These additions will be placed in the Evaluation section. revision: yes

standing simulated objections not resolved

New experimental measurements on additional mobile SoCs, accelerators, DVFS regimes, or sensor pipelines were not collected and cannot be provided without further hardware experiments.

Circularity Check

0 steps flagged

No significant circularity detected in Scale-Gest derivation chain

full rationale

The paper constructs a family of tiny-YOLO variants at different resolutions and strides, calibrates ACE operating points by direct measurement on the target laptop hardware, and evaluates the runtime controller plus ROI gate on the newly introduced DSG-18 dataset. All reported numbers (4x energy reduction from 6.9 mJ to 1.6 mJ, F1 0.8-0.9, 6 ms latency) are obtained by explicit execution and measurement rather than by any equation that re-derives a fitted quantity from itself. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the central claims; the framework is presented as an engineering synthesis whose performance is validated externally on the provided dataset and hardware.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework relies on standard assumptions about YOLO performance scaling with resolution and stride, plus device-specific calibration which is ad-hoc. No explicit free parameters listed in abstract, but the selection thresholds and profiles are likely fitted to the laptop hardware.

free parameters (1)

ACE profile thresholds
The boundaries for switching between accuracy-complexity-energy modes are calibrated on the specific device and not derived from first principles.

axioms (1)

domain assumption Tiny-YOLO variants maintain usable accuracy at reduced resolutions and strides
Assumed without proof in the abstract when expanding the model space.

invented entities (2)

ACE profiles no independent evidence
purpose: To quantify and select operating points for accuracy, complexity, and energy
New concept introduced for runtime selection
DSG-18 dataset no independent evidence
purpose: Temporally-annotated driving gesture data for evaluation
New dataset created for the paper

pith-pipeline@v0.9.0 · 5547 in / 1574 out tokens · 42148 ms · 2026-05-15T10:40:08.224699+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

[1]

Applied Sciences 13, 20 (2023)

Ahn, H., Son, S., Roh, J., Baek, H., Lee, S., Chung, Y., and Park, D.Safp-yolo: Enhanced object detection speed using spatial attention-based filter pruning. Applied Sciences 13, 20 (2023)

work page 2023
[2]

M., Pasricha, S., Maciejewski, A

Al-Qawasmeh, A. M., Pasricha, S., Maciejewski, A. M., and Siegel, H. J. Thermal-aware performance optimization in power constrained heterogenous data centers. In2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum(2012), pp. 27–40

work page 2012
[3]

In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)(Jan

Alexander, K., Karina, K., Alexander, N., Roman, K., and Andrei, M.Hagrid – hand gesture recognition image dataset. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)(Jan. 2024), IEEE

work page 2024
[4]

Angell, L., Seaman, S., Payyanadan, R., Biever, W., Seppelt, B., Mehler, B., and Reimer, B.In the context of whole trips: New insights into driver management of attention and tasks. pp. 1–7

work page
[5]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M.Yolov4: Optimal speed and accuracy of object detection.ArXiv abs/2004.10934(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

A., Masera, G., and Martina, M.An updated survey of efficient hardware architectures for acceler- ating deep convolutional neural networks.Future Internet 12(2020), 113

Capra, M., Bussolino, B., Marchisio, A., Shafiqe, M. A., Masera, G., and Martina, M.An updated survey of efficient hardware architectures for acceler- ating deep convolutional neural networks.Future Internet 12(2020), 113

work page 2020
[7]

InProceedings of the Second Conference on Machine Learning and Systems, SysML 2019, Stanford, CA, USA, March 31 - April 2, 2019(2019), A

Chin, T., Ding, R., and Marculescu, D.Adascale: Towards real-time video object detection using adaptive scaling. InProceedings of the Second Conference on Machine Learning and Systems, SysML 2019, Stanford, CA, USA, March 31 - April 2, 2019(2019), A. Talwalkar, V. Smith, and M. Zaharia, Eds., mlsys.org

work page 2019
[8]

S., Kumar, A., Hafiz, R., and Shafiqe, M.Embracing approximate computing for energy-efficient motion estimation in high efficiency video coding

El-Harouni, W., Rehman, S., Prabakaran, B. S., Kumar, A., Hafiz, R., and Shafiqe, M.Embracing approximate computing for energy-efficient motion estimation in high efficiency video coding. InDesign, Automation & Test in Europe Conference & Exhibition (DATE), 2017(2017), pp. 1384–1389

work page 2017
[9]

In2020 IEEE/ACM Symposium on Edge Computing (SEC)(2020), pp

Fang, B., Zeng, X., Zhang, F., Xu, H., and Zhang, M.Flexdnn: Input-adaptive on-device deep learning for efficient mobile vision. In2020 IEEE/ACM Symposium on Edge Computing (SEC)(2020), pp. 84–95

work page 2020
[10]

InProceedings of the 24th Annual International Conference on Mobile Computing and Networking(New York, NY, USA, 2018), MobiCom ’18, Association for Computing Machinery, p

Fang, B., Zeng, X., and Zhang, M.Nestdnn: Resource-aware multi-tenant on- device deep learning for continuous mobile vision. InProceedings of the 24th Annual International Conference on Mobile Computing and Networking(New York, NY, USA, 2018), MobiCom ’18, Association for Computing Machinery, p. 115–127

work page 2018
[11]

P., and Morales, D

Fertl, E., Castillo, E., Stettinger, G., Cuéllar, M. P., and Morales, D. P. Hand gesture recognition on edge devices: Sensor technologies, algorithms, and processing hardware.Sensors 25, 6 (2025)

work page 2025
[12]

García, C., Juan, G. B., Ayuso, F., Prieto-Matias, M., and Tirado, F.Multi- gpu based on multicriteria optimization for motion estimation system.EURASIP Journal on Advances in Signal Processing 2013(2013)

work page 2013
[13]

YOLOX: Exceeding YOLO Series in 2021

Ge, Z., Liu, S., W ang, F., Li, Z., and Sun, J.Yolox: Exceeding yolo series in 2021. ArXiv abs/2107.08430(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

J.Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding.arXiv: Computer Vision and Pattern Recognition(2015)

Han, S., Mao, H., and Dally, W. J.Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding.arXiv: Computer Vision and Pattern Recognition(2015)

work page 2015
[15]

InProceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,(2021), INSTICC, SciTePress, pp

Hu, L., and Li, Y.Micro-yolo: Exploring efficient methods to compress cnn based object detection model. InProceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,(2021), INSTICC, SciTePress, pp. 151–158. [16]Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D.Quant...

work page 2021
[16]

C.SSD: Single Shot MultiBox Detector

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C.SSD: Single Shot MultiBox Detector. Springer International Publishing, 2016, pp. 21–37

work page 2016
[17]

[20]Putra, R

Neurauter, M., Hankey, J., and Young, R.Radio usage: Observations from the 100-car naturalistic driving study. [20]Putra, R. V. W., Hanif, M. A., and Shafiqe, M.Romanet: Fine-grained reuse- driven off-chip memory access management and data organization for deep neural network accelerators.IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29...

work page 2021
[18]

Putra, R. V. W., Hanif, M. A., and Shafiqe, M.Pendram: Enabling high- performance and energy-efficient processing of deep neural networks through a generalized dram data mapping policy, 2024

work page 2024
[19]

Redmon, J., and Farhadi, A.Yolov3: An incremental improvement.ArXiv abs/1804.02767(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Seo, D., Yang, H., and Kim, H.Dyra: Portable dynamic resolution adjustment network for existing detectors, 2024

work page 2024
[21]

In2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010)(2010), pp

Shafiqe, M., Bauer, L., and Henkel, J.enbudget: A run-time adaptive predictive energy-budgeting scheme for energy-aware motion estimation in h.264/mpeg-4 avc video encoder. In2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010)(2010), pp. 1725–1730

work page 2010
[22]

V., and Hanif, M

Shafiqe, M., Marchisio, A., Wicaksana Putra, R. V., and Hanif, M. A.To- wards energy-efficient and secure edge ai: A cross-layer framework iccad special session paper. In2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)(2021), pp. 1–9

work page 2021
[23]

L., Bampi, S., and Henkel, J.Adaptive power management of on-chip video memory for multiview video coding

Shafiqe, M., Zatt, B., W alter, F. L., Bampi, S., and Henkel, J.Adaptive power management of on-chip video memory for multiview video coding. InProceedings of the 49th Annual Design Automation Conference(New York, NY, USA, 2012), DAC ’12, Association for Computing Machinery, p. 866–875

work page 2012
[24]

T.Branchynet: Fast inference via early exiting from deep neural networks.2016 23rd International Conference on Pattern Recognition (ICPR)(2016), 2464–2469

Teerapittayanon, S., McDanel, B., and Kung, H. T.Branchynet: Fast inference via early exiting from deep neural networks.2016 23rd International Conference on Pattern Recognition (ICPR)(2016), 2464–2469

work page 2016
[25]

YOLOv12: Attention-Centric Real-Time Object Detectors

Tian, Y., Ye, Q., and Doermann, D. S.Yolov12: Attention-centric real-time object detectors.ArXiv abs/2502.12524(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

E.Skipnet: Learn- ing dynamic routing in convolutional networks

Wang, X., Yu, F., Dou, Z.-Y., Darrell, T., and Gonzalez, J. E.Skipnet: Learn- ing dynamic routing in convolutional networks. InProceedings of the European Conference on Computer Vision (ECCV)(September 2018)

work page 2018
[27]

In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(Los Alamitos, CA, USA, June 2021), IEEE Computer Society, pp

Wang, Z., Li, C., and Wang, X.Convolutional Neural Network Pruning with Structural Redundancy Reduction . In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(Los Alamitos, CA, USA, June 2021), IEEE Computer Society, pp. 14908–14917

work page 2021
[28]

Younesi, A., Ansari, M., Fazli, M., Ejlali, A., Shafiqe, M., and Henkel, J.A comprehensive survey of convolutions in deep learning: Applications, challenges, and future trends.IEEE Access 12(2024), 41180–41218

work page 2024
[29]

Zhu, X., Xiong, Y., Dai, J., Yuan, L., and Wei, Y.Deep feature flow for video recognition.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016), 4141–4150

work page 2017