arxiv: 2604.21011 · v1 · submitted 2026-04-22 · 💻 cs.CV · q-bio.NC

Recognition: unknown

Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition

Naga VS Raviteja Chappa , Evangelos Sariyanidi , Lisa Yankowitz , Gokul Nair , Casey J. Zampella , Robert T. Schultz , Birkan Tun\c{c}

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:06 UTC · model grok-4.3

classification 💻 cs.CV q-bio.NC

keywords micro-action recognitiondual-path networkspatio-temporal modelingadaptive routingfine-grained video understandingbody-part entitiesaction consistency loss

0 comments

The pith

A dual-path network with parallel spatial-then-temporal and temporal-then-spatial routes plus adaptive per-body-part routing improves recognition of subtle micro-actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Micro-actions are brief localized movements like head scratching whose defining features can be either spatial arrangements or timing patterns. Existing video models lock into one fixed order of processing space and time, which leaves some actions poorly modeled. The paper introduces two parallel pathways that invert the order of spatial and temporal modeling, then lets each anatomically defined body part learn which order suits it best through adaptive routing. A consistency loss keeps the two pathways aligned on the same action label. If this architecture matches the inherent variety of micro-actions, it should raise accuracy on datasets built specifically for these fine movements.

Core claim

Micro-actions display diverse spatio-temporal characteristics that no single decomposition order can capture, so a dual-path network running an ST pathway and a TS pathway in parallel, combined with entity-level adaptive routing that lets each body part select its preferred order and a Mutual Action Consistency loss that enforces cross-path agreement, accommodates this diversity and delivers competitive results on MA-52 together with state-of-the-art performance on iMiGUE.

What carries the argument

Dual parallel ST and TS pathways that process anatomically-grounded spatial entities, with entity-level adaptive routing deciding the processing order per body part and Mutual Action Consistency loss maintaining coherence.

If this is right

Each body part can be assigned an optimal processing order rather than a uniform fusion strategy.
Cross-path consistency enforcement improves recognition when spatial and temporal cues conflict.
Performance gains appear on datasets that emphasize subtle localized movements.
The architecture scales to multiple entity types without requiring hand-designed fusion weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-order idea could be tested on other short-duration fine-grained video tasks such as micro-gestures or facial micro-expressions.
If routing preferences prove stable across datasets, the method might reduce the need for dataset-specific hyperparameter tuning of temporal modeling.
Extending the routing to operate at the pixel level rather than body-part level could reveal whether the benefit is truly entity-driven.

Load-bearing premise

Micro-actions inherently require more than one fixed spatio-temporal decomposition order and entity-level routing can stably learn the right preference for each body part.

What would settle it

A single-path network matching or exceeding the dual-path accuracy on the iMiGUE dataset would show that the diversity claim and the need for dual routes do not hold.

Figures

Figures reproduced from arXiv: 2604.21011 by Birkan Tun\c{c}, Casey J. Zampella, Evangelos Sariyanidi, Gokul Nair, Lisa Yankowitz, Naga VS Raviteja Chappa, Robert T. Schultz.

**Figure 2.** Figure 2: Overview of the proposed Micro-DualNet framework. Given input video frames with corresponding body joints, our framework extracts CNN features (fCNN) and decomposes them into anatomically-grounded entity features (X) via the Spatial Entity Module (SEM) [section III-B]. These entity features are processed through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways [section III-C]: the ST path … view at source ↗

**Figure 3.** Figure 3: Violin plots of percent time engaged in “retracting feet” and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE [38] visualizations of learned representations on MA-52 [14] (top) and iMiGUE [26] (bottom) datasets. Baseline (left) shows heavily overlapping classes, single ST/TS paths (middle) yield partial clustering improvements, while Micro-DualNet (right) shows improved, though not complete, class grouping; with several action categories forming more coherent clusters. The remaining overlap reflects the inhe… view at source ↗

**Figure 5.** Figure 5: (Left) Class-wise accuracy heatmap for Dual, ST, and TS paths across MA-52 [14] dataset. (Right) Model performance by class difficulty: dual-path [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one's head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics. Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset and state-of-the-art results on iMiGUE dataset. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual-path adaptive routing for micro-actions looks promising on paper but needs ablations to confirm it beats simpler alternatives.

read the letter

The main takeaway is that the paper puts forward a dual-path ST/TS network with per-entity adaptive routing and a consistency loss for micro-actions, but the abstract doesn't show the numbers or ablations needed to judge it. The new part is treating body parts as entities that can pick their own processing order—spatial first or temporal first—and adding the MAC loss to keep things coherent across paths. That seems like a sensible response to the variety in micro-actions. It does a good job laying out why current methods struggle with these subtle, short movements and why anatomical grounding helps. Where it gets soft is the evidence. No quantitative results, baselines, or ablations appear in the abstract, so we can't tell if the adaptive routing actually improves over simpler setups or if the SOTA claim on iMiGUE holds up. The stress-test point about needing those comparisons is accurate based on what's provided; without them the necessity of the dual construction isn't shown. This paper is for folks in computer vision working on fine-grained action recognition or related areas like behavioral analysis. Someone looking for ideas on handling spatio-temporal diversity in video would find the architecture worth considering. I'd recommend sending it to peer review. The idea is specific and the motivation is solid, even if the experiments will need to be scrutinized closely.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Micro-DualNet, a dual-path spatio-temporal network for micro-action recognition. It processes anatomically-grounded spatial entities via parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways, introduces entity-level adaptive routing to learn per-body-part processing preferences, and employs a Mutual Action Consistency (MAC) loss to enforce cross-path coherence. The work claims that this architecture accommodates the diverse spatio-temporal characteristics of micro-actions (lasting 1-3 seconds) better than single-decomposition methods, reporting competitive performance on the MA-52 dataset and state-of-the-art results on the iMiGUE dataset.

Significance. If the performance claims and necessity of the dual adaptive design are substantiated, the paper would offer a targeted architectural solution to a recognized gap in fine-grained video understanding, particularly for subtle actions relevant to social interaction analysis. The entity-level routing and MAC loss represent a concrete mechanism for handling action diversity without fixed fusion, which could influence subsequent work on adaptive spatio-temporal models if supported by rigorous controls.

major comments (2)

[Abstract and Introduction] The abstract and introduction assert that micro-actions exhibit diverse spatio-temporal characteristics unaccommodated by any single decomposition and that entity-level adaptive routing learns stable preferences, yet no ablation results are referenced comparing the full dual-path model against single ST-path, single TS-path, or fixed equal-weight fusion baselines. This comparison is load-bearing for the central claim that the dual construction is required.
[Abstract] The experimental claims of competitive results on MA-52 and SOTA on iMiGUE are stated without any reported quantitative metrics (accuracy, F1, etc.), dataset statistics, baseline comparisons, ablation tables, or error bars in the provided text, rendering the performance assertions unevaluable and the soundness of the empirical validation low.

minor comments (1)

[Abstract] The definition of micro-actions as 'lasting 1-3 seconds' would benefit from a citation to prior literature establishing this temporal scale to ground the problem statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the manuscript can more clearly substantiate its central claims. We address each major comment below and will revise the manuscript accordingly to improve clarity and empirical grounding.

read point-by-point responses

Referee: [Abstract and Introduction] The abstract and introduction assert that micro-actions exhibit diverse spatio-temporal characteristics unaccommodated by any single decomposition and that entity-level adaptive routing learns stable preferences, yet no ablation results are referenced comparing the full dual-path model against single ST-path, single TS-path, or fixed equal-weight fusion baselines. This comparison is load-bearing for the central claim that the dual construction is required.

Authors: We agree that explicit cross-references to these comparisons are needed in the abstract and introduction to support the core architectural claim. The full manuscript includes ablation studies in the experiments section that directly compare the complete dual-path model (with entity-level adaptive routing and MAC loss) against single ST-path, single TS-path, and fixed equal-weight fusion baselines, demonstrating consistent gains from the adaptive dual design. We will revise the introduction to reference these results explicitly (e.g., citing the relevant tables and summarizing the performance deltas) and add a concise mention of the key ablation outcomes to the abstract. revision: yes
Referee: [Abstract] The experimental claims of competitive results on MA-52 and SOTA on iMiGUE are stated without any reported quantitative metrics (accuracy, F1, etc.), dataset statistics, baseline comparisons, ablation tables, or error bars in the provided text, rendering the performance assertions unevaluable and the soundness of the empirical validation low.

Authors: We concur that the abstract would be strengthened by including specific quantitative metrics. The manuscript reports detailed results in the experiments section, including accuracy and F1 scores on MA-52 and iMiGUE, dataset statistics, comparisons to multiple baselines, ablation tables, and error bars (standard deviations across runs). We will revise the abstract to incorporate the key numerical results and notable improvements while maintaining conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal without derivations

full rationale

The manuscript is an empirical proposal of a dual-path network (ST/TS pathways, entity-level adaptive routing, MAC loss) for micro-action recognition. No equations, derivations, or mathematical steps appear in the abstract or described content. Central claims rest on reported performance on MA-52 and iMiGUE datasets rather than any self-referential fitting, self-definition, or load-bearing self-citation chains. No reduction of a 'prediction' or 'result' to its own inputs by construction is present, satisfying the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that micro-actions require dual processing orders and on multiple learned components whose values are fitted during training.

free parameters (2)

per-entity routing weights
Learned preferences for ST versus TS path per body part; number of parameters and initialization unspecified.
MAC loss weighting factor
Balance between consistency loss and classification loss; value chosen to optimize reported performance.

axioms (1)

domain assumption Micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics.
Invoked in the abstract as the fundamental challenge that single-decomposition methods cannot address.

pith-pipeline@v0.9.0 · 5543 in / 1257 out tokens · 38757 ms · 2026-05-10T00:06:11.404943+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Bertasius, H

G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

2021
[2]

Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y . Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):172–186, 2019

2019
[3]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

2017
[4]

H. Chen, H. Shi, X. Liu, X. Li, and G. Zhao. Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis.International Journal of Computer Vision, 131(6):1346– 1366, 2023

2023
[5]

Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu. Channel- wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 13359–13368, 2021

2021
[6]

Cheng, Y

K. Cheng, Y . Zhang, X. He, W. Chen, J. Cheng, and H. Lu. Skeleton- based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 183–192, 2020

2020
[7]

H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai. Revisiting skeleton- based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022

2022
[8]

Feichtenhofer, H

C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019

2019
[9]

Goldman and P

S. Goldman and P. E. Greene. Stereotypies in autism: a video demonstration of their clinical variability.Frontiers in integrative neuroscience, 6:121, 2013

2013
[10]

F. Gong, J. Chen, J. Zhu, Q. Bao, F. Gao, R. Gu, and G. Xu. Micro- action recognition via hierarchical fusion and inference. InProceed- ings of the 32nd ACM International Conference on Multimedia, MM ’24, page 11327–11332, New York, NY , USA, 2024. Association for Computing Machinery

2024
[11]

Grzadzinski, T

R. Grzadzinski, T. Carr, C. Colombi, K. McGuire, S. Dufek, A. Pick- les, and C. Lord. Measuring changes in social communication behaviors: Preliminary development of the brief observation of social communication change (boscc).Journal of autism and developmental disorders, 46(7):2464–2479, 2016

2016
[12]

J. Gu, K. Li, F. Wang, Y . Wei, Z. Wu, H. Fan, and M. Wang. Motion matters: Motion-guided modulation network for skeleton-based micro- action recognition. InProceedings of the 33rd ACM International Conference on Multimedia, 2025

2025
[13]

J. Gu, F. Wang, K. Li, Y . Wei, Z. Wu, and D. Guo. Mm-gesture: Towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344, 2025

work page arXiv 2025
[14]

D. Guo, K. Li, B. Hu, Y . Zhang, and M. Wang. Benchmarking micro- action recognition: Dataset, methods, and applications.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(7):6238–6252, 2024

2024
[15]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

2017
[16]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

work page Pith review arXiv 2016
[17]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review arXiv 2017
[18]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In2011 International conference on computer vision, pages 2556–2563. IEEE, 2011

2011
[19]

S. R. Leekam, M. R. Prior, and M. Uljarevic. Restricted and repetitive behaviors in autism spectrum disorders: a review of research in the last decade.Psychological bulletin, 137(4):562, 2011

2011
[20]

K. Li, D. Guo, G. Chen, C. Fan, J. Xu, Z. Wu, H. Fan, and M. Wang. Prototypical calibrating ambiguous samples for micro-action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4815–4823, 2025

2025
[21]

K. Li, D. Guo, G. Chen, X. Peng, and M. Wang. Joint skeletal and semantic embedding loss for micro-gesture classification.arXiv preprint arXiv:2307.10624, 2023

work page arXiv 2023
[22]

K. Li, Y . Wang, G. Peng, G. Song, Y . Liu, H. Li, and Y . Qiao. Uniformer: Unified transformer for efficient spatial-temporal repre- sentation learning. InInternational Conference on Learning Repre- sentations, 2022

2022
[23]

Q. Li, X. Huang, H. Chen, F. He, Q. Chen, and Z. Wang. Advancing micro-action recognition with multi-auxiliary heads and hybrid loss optimization. InProceedings of the 32nd ACM International Con- ference on Multimedia, MM ’24, page 11313–11319, New York, NY , USA, 2024. Association for Computing Machinery

2024
[24]

J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

2019
[25]

P. Liu, F. Wang, K. Li, G. Chen, Y . Wei, S. Tang, Z. Wu, and D. Guo. Micro-gesture online recognition using learnable query points.arXiv preprint arXiv:2407.04490, 2024

work page arXiv 2024
[26]

X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao. imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10631–10642, 2021

2021
[27]

Y . Liu, L. Wang, Y . Wang, X. Ma, and Y . Qiao. Fineaction: A fine-grained video dataset for temporal action localization.IEEE transactions on image processing, 31:6937–6950, 2022

2022
[28]

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022

2022
[29]

H. Luo, G. Lin, Y . Yao, Z. Tang, Q. Wu, and X. Hua. Dense semantics- assisted networks for video action recognition.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):3073–3084, 2021

2021
[30]

Ma, M.-H

C.-Y . Ma, M.-H. Chen, Z. Kira, and G. AlRegib. Ts-lstm and temporal- inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Processing: Image Communication, 71:76–87, 2019

2019
[31]

C. C. Nuckols and C. C. Nuckols. The diagnostic and statistical manual of mental disorders,(dsm-5).Philadelphia: American Psychiatric Association, 2013

2013
[32]

S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016

work page arXiv 2016
[33]

H. Shao, S. Qian, and Y . Liu. Temporal interlacing network. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11966–11973, 2020

2020
[34]

L. Shi, Y . Zhang, J. Cheng, and H. Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019

2019
[35]

L. Shi, Y . Zhang, J. Cheng, and H. Lu. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, 29:9532–9545, 2020

2020
[36]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review arXiv 2012
[37]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceed- ings of the IEEE international conference on computer vision, pages 4489–4497, 2015

2015
[38]

Van der Maaten and G

L. Van der Maaten and G. Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

2008
[39]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14549–14560, 2023

2023
[40]

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks for action recognition in videos.IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018

2018
[41]

H. Wu, X. Ma, and Y . Li. Spatiotemporal multimodal learning with 3d cnns for video action recognition.IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1250–1261, 2021

2021
[42]

S. Yan, Y . Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

2018