pith. machine review for the scientific record. sign in

arxiv: 2605.14548 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Local Spatiotemporal Convolutional Network for Robust Gait Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords gait recognitionspatiotemporal convolution2D convolutional networktemporal feature extractionbiometric identificationvideo motion analysisstrip-based pooling
0
0 comments X

The pith

A dual-branch network endows standard 2D convolutions with the ability to extract temporal gait motion patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Local Spatiotemporal Convolutional Network to extract walking dynamics from video frames without relying on complex 3D convolutions or recurrent models. It introduces Global Bidirectional Spatial Pooling to break gait features into horizontal and vertical strips, letting the temporal dimension participate directly in ordinary 2D convolution operations. A Local Spatiotemporal Convolutional layer then learns adaptive strip-based motion patterns, with asymmetric kernels further enriching representations across separate domains. This keeps the architecture simple while addressing interference from viewpoint, clothing, and carrying variations. A sympathetic reader would care because gait offers practical non-invasive identification over distance, yet most current solutions either ignore time or demand heavy computation.

Core claim

The central claim is that the LSTCN dual-branch architecture, built on Global Bidirectional Spatial Pooling to decompose gait tensors into horizontal and vertical strip-based local representations, combined with Local Spatiotemporal Convolutional layers and asymmetric kernels, allows standard 2D convolutional networks to jointly process temporal and spatial dimensions and thereby capture intrinsic motion patterns for robust gait recognition.

What carries the argument

Global Bidirectional Spatial Pooling (GBSP) that reduces spatial dimensionality into strip-based local representations so temporal information can enter standard 2D convolutions, plus the Local Spatiotemporal Convolutional (LSTC) layer that adaptively learns motion patterns.

If this is right

  • Standard 2D convolutional networks gain the capacity to extract temporal information from consecutive gait frames.
  • The approach reduces reliance on computationally heavy sequential models such as LSTMs or 3D convolutions.
  • Asymmetric convolution kernels independently attend to temporal, spatial, and joint spatiotemporal domains to enrich features.
  • Local strip representations help the network adaptively learn gait motion patterns under external variations.
  • The overall architecture remains structurally simple while processing video data complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The strip-based reduction technique could transfer to other video tasks where temporal modeling is needed but 3D compute is unavailable.
  • Dual-branch design may offer a template for balancing local motion cues with global appearance in related biometrics.
  • Further tests on mobile hardware would check whether the efficiency gains enable real-time gait applications.
  • The method might reveal which local spatial strips carry the most identity-discriminative temporal signals.

Load-bearing premise

That reducing gait tensors to horizontal and vertical strip-based local representations preserves enough discriminative motion information to handle covariate changes without major loss.

What would settle it

If the LSTCN model achieves lower accuracy than a standard 3D convolutional baseline on a gait benchmark containing viewpoint, clothing, and carrying variations, the claim that the strip-based reduction enables effective temporal capture would be challenged.

read the original abstract

Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Local Spatiotemporal Convolutional Network (LSTCN) for gait recognition, a dual-branch architecture that augments standard 2D CNNs with temporal modeling capacity. It introduces Global Bidirectional Spatial Pooling (GBSP) to decompose gait tensors into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in 2D convolutions, and designs Local Spatiotemporal Convolutional (LSTC) layers with asymmetric kernels that jointly process temporal, spatial, and joint spatiotemporal domains to learn strip-based motion patterns.

Significance. If the empirical claims hold, the work offers a structurally simple alternative to 3D convolutions or recurrent models for extracting intrinsic gait motion under covariates, with potential computational advantages. The explicit design of asymmetric kernels and the GBSP reduction mechanism are clear contributions that could be adopted in other video-based recognition tasks.

major comments (2)
  1. [GBSP mechanism and LSTC layer description] The central claim that GBSP strip pooling preserves fine-grained temporal variations (e.g., localized limb trajectories) without meaningful loss of discriminative information under clothing, bag, or viewpoint changes is load-bearing but unsupported by any analysis or ablation in the provided text. The description in the abstract and architecture section implies intra-strip averaging occurs, yet no quantitative demonstration (e.g., comparison of feature discriminability before/after pooling on covariate-altered silhouettes) is given to refute the risk that this smoothing degrades input to the LSTC layers.
  2. [Abstract and proposed method] No quantitative results, ablation studies, or error analysis appear in the abstract or architecture sections to verify that the dual-branch LSTCN actually improves recognition accuracy or robustness relative to baselines (standard 2D CNNs, 3D CNNs, or LSTMs). Without these, the assertion that the architecture 'endows standard two-dimensional convolutional networks with the capacity to extract temporal information' remains unverified.
minor comments (1)
  1. [LSTC layer] Notation for the asymmetric convolution kernels (temporal, spatial, and joint domains) should be formalized with explicit equations or a diagram showing kernel dimensions and how they are applied to the GBSP outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below and will revise the paper to incorporate additional supporting analysis and clarifications.

read point-by-point responses
  1. Referee: [GBSP mechanism and LSTC layer description] The central claim that GBSP strip pooling preserves fine-grained temporal variations (e.g., localized limb trajectories) without meaningful loss of discriminative information under clothing, bag, or viewpoint changes is load-bearing but unsupported by any analysis or ablation in the provided text. The description in the abstract and architecture section implies intra-strip averaging occurs, yet no quantitative demonstration (e.g., comparison of feature discriminability before/after pooling on covariate-altered silhouettes) is given to refute the risk that this smoothing degrades input to the LSTC layers.

    Authors: We acknowledge that the current manuscript does not include explicit quantitative analysis of the GBSP mechanism's effect on temporal feature preservation. In the revised version, we will add a dedicated ablation study in the Experiments section. This will compare feature discriminability metrics (such as class separability via Fisher discriminant ratio and reconstruction fidelity of limb trajectories) on covariate-altered gait silhouettes before and after GBSP pooling, directly addressing the potential smoothing effects and demonstrating retention of fine-grained motion patterns. revision: yes

  2. Referee: [Abstract and proposed method] No quantitative results, ablation studies, or error analysis appear in the abstract or architecture sections to verify that the dual-branch LSTCN actually improves recognition accuracy or robustness relative to baselines (standard 2D CNNs, 3D CNNs, or LSTMs). Without these, the assertion that the architecture 'endows standard two-dimensional convolutional networks with the capacity to extract temporal information' remains unverified.

    Authors: The manuscript presents quantitative comparisons against 2D CNN, 3D CNN, and LSTM baselines, along with ablation studies on the dual-branch design and LSTC layers, in the Experiments section (including results on CASIA-B and other gait datasets under clothing, bag, and viewpoint covariates). To improve accessibility, we will revise the abstract to include a concise summary of key accuracy improvements and add a short paragraph in the architecture section referencing the empirical verification from the experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents LSTCN as an architectural proposal consisting of GBSP strip pooling and LSTC layers with asymmetric kernels. These components are introduced as independent design decisions to enable standard 2D convolutions to process temporal information from gait tensors. No equations, fitted parameters, or predictions are defined that reduce by construction to the inputs or to self-citations. The central claims rest on the empirical performance of the proposed structure rather than any definitional equivalence or load-bearing self-reference chain. The derivation is self-contained as a standard neural architecture contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that strip pooling preserves sufficient gait dynamics for 2D convolution to learn temporal patterns; no external benchmarks or prior proofs are invoked in the abstract.

free parameters (1)
  • asymmetric convolution kernel dimensions
    Kernel sizes chosen independently for temporal, spatial, and joint domains; specific values not stated in abstract.
axioms (1)
  • domain assumption Decomposing spatial features into horizontal and vertical strips enables temporal information to participate in standard 2D convolution without loss of motion discriminability.
    Invoked in the definition of GBSP and LSTC layers.
invented entities (2)
  • Global Bidirectional Spatial Pooling (GBSP) no independent evidence
    purpose: Reduce dimensionality of gait tensors into strip-based local representations
    Newly introduced mechanism to enable temporal processing in 2D CNNs.
  • Local Spatiotemporal Convolutional (LSTC) layer no independent evidence
    purpose: Jointly process temporal and spatial dimensions with adaptive strip-based patterns
    Core new layer proposed in the architecture.

pith-pipeline@v0.9.0 · 5549 in / 1290 out tokens · 54478 ms · 2026-05-15T02:03:29.160864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Overview of biometrics research,

    Z. Sun, R. He, L. Wang, M. Kan, J. Feng, F. Zheng, W. Zheng, W. Zuo, W. Kang, W. Deng, J. Zhang, H. Han, S. Shan, Y . Wang, Y . Ru, Y . Zhu, Y . Liu, and Y . He, “Overview of biometrics research,”Journal of Image and Graphics, vol. 26, no. 6, pp. 1254–1329, 2021

  2. [2]

    Deep gait recog- nition: A survey,

    A. Sepas-Moghaddam and A. Etemad, “Deep gait recog- nition: A survey,”IEEE Transactions on Pattern Analy- sis and Machine Intelligence, vol. 45, no. 1, pp. 264–284, 2023

  3. [3]

    Few could be better than all: Feature sampling and grouping for scene text detection,

    J. Tang, W. Zhang, H. Liu, M. Yang, B. Jiang, G. Hu, and X. Bai, “Few could be better than all: Feature sampling and grouping for scene text detection,” inProceedings of 8 the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 4563–4572

  4. [4]

    Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,

    J. Tang, W. Qian, L. Song, X. Dong, L. Li, and X. Bai, “Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,” inProceedings of the European Confer- ence on Computer Vision, 2022, pp. 233–248

  5. [5]

    A com- prehensive study on cross-view gait based human identi- fication with deep CNNs,

    Z. Wu, Y . Huang, L. Wang, X. Wang, and T. Tan, “A com- prehensive study on cross-view gait based human identi- fication with deep CNNs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 2, pp. 209– 226, 2017

  6. [6]

    Character recognition competi- tion for street view shop signs,

    J. Tang, W. Du, B. Wang, W. Zhou, S. Mei, T. Xue, X. Xu, and H. Zhang, “Character recognition competi- tion for street view shop signs,”National Science Review, vol. 10, no. 6, p. nwad141, 2023

  7. [7]

    You can even annotate text with voice: Transcription- only-supervised text spotting,

    J. Tang, S. Qiao, B. Cui, Y . Ma, S. Zhang, and D. Kanoulas, “You can even annotate text with voice: Transcription- only-supervised text spotting,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4154–4163

  8. [8]

    A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition,

    S. Yu, D. Tan, and T. Tan, “A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition,” inProceedings of the 18th International Conference on Pattern Recognition (ICPR). IEEE, 2006, pp. 441–444

  9. [9]

    TextSquare: Scaling up text-centric visual instruction tuning,

    J. Tang, C. Lin, Z. Zhao, S. Wei, B. Wu, Q. Liu, Y . He, K. Lu, H. Feng, Y . Liet al., “TextSquare: Scaling up text-centric visual instruction tuning,”arXiv preprint arXiv:2404.12803, 2024

  10. [10]

    DocPedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,

    H. Feng, Q. Liu, H. Liu, J. Tang, W. Zhou, H. Li, and C. Huang, “DocPedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,”Science China Information Sciences, 2024

  11. [11]

    Coupled patch alignment for matching cross-view gaits,

    X. Ben, C. Gong, P. Zhang, X. Jia, Q. Wu, and W. Meng, “Coupled patch alignment for matching cross-view gaits,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 3142–3157, 2019

  12. [12]

    Gaitset: Regard- ing gait as a set for cross-view gait recognition,

    H. Chao, Y . He, J. Zhang, and J. Feng, “Gaitset: Regard- ing gait as a set for cross-view gait recognition,” inPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, 2019, pp. 8126–8133

  13. [13]

    UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

    H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,”arXiv preprint arXiv:2308.11592, 2023

  14. [14]

    A strong and robust skeleton-based gait recognition method with gait periodicity priors,

    N. Li and X. Zhao, “A strong and robust skeleton-based gait recognition method with gait periodicity priors,”IEEE Transactions on Multimedia, vol. 25, pp. 3046–3058, 2023

  15. [15]

    GaitPart: Temporal part-based model for gait recognition,

    C. Fan, Y . Peng, C. Cao, X. Liu, S. Hou, J. Chi, Y . Huang, Q. Li, and Z. He, “GaitPart: Temporal part-based model for gait recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 225–14 233

  16. [16]

    Multifaceted-features enhancement-relevant gait recogni- tion method,

    S. Hou, Y . Fu, A. Li, X. Liu, C. Cao, and Y . Huang, “Multifaceted-features enhancement-relevant gait recogni- tion method,”Journal of Image and Graphics, vol. 28, no. 5, pp. 1477–1486, 2023

  17. [17]

    TabPedia: Towards com- prehensive visual table understanding with concept syn- ergy,

    W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, W. Zhouet al., “TabPedia: Towards com- prehensive visual table understanding with concept syn- ergy,” inAdvances in Neural Information Processing Sys- tems, 2024

  18. [18]

    Deep high- resolution representation learning for human pose esti- mation,

    K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high- resolution representation learning for human pose esti- mation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703

  19. [19]

    SPTS v2: Single- point scene text spotting,

    Y . Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Baiet al., “SPTS v2: Single- point scene text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 249–15 264, 2023

  20. [20]

    Performance evaluation of model-based gait on multi-view very large population database with pose se- quences,

    W. An, S. Yu, Y . Makihara, X. Wu, C. Xu, Y . Yu, R. Liao, and Y . Yagi, “Performance evaluation of model-based gait on multi-view very large population database with pose se- quences,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 2, no. 4, pp. 421–430, 2020

  21. [21]

    Pose- based temporal-spatial network (PTSN) for gait recogni- tion with carrying and clothing variations,

    R. Liao, C. Cao, E. B. Garcia, S. Yu, and Y . Huang, “Pose- based temporal-spatial network (PTSN) for gait recogni- tion with carrying and clothing variations,” inProceedings of the 12th Chinese Conference on Biometric Recognition (CCBR). Springer, 2017, pp. 474–483

  22. [22]

    Gaitgraph: Graph convolutional network for skeleton-based gait recognition,

    T. Teepe, A. Khan, J. Gilg, F. Herzog, S. H ¨ormann, and G. Rigoll, “Gaitgraph: Graph convolutional network for skeleton-based gait recognition,” inProceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 2314–2318

  23. [23]

    Towards a deeper understanding of skeleton-based gait recognition,

    T. Teepe, J. Gilg, F. Herzog, S. H ¨ormann, and G. Rigoll, “Towards a deeper understanding of skeleton-based gait recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022, pp. 1569–1577

  24. [24]

    Harmonizing visual text comprehension and generation,

    Z. Zhao, J. Tang, B. Wu, C. Lin, S. Wei, H. Liu, X. Tan, Z. Zhang, C. Huang, and Y . Xie, “Harmonizing visual text comprehension and generation,” inAdvances in Neural In- formation Processing Systems, 2024

  25. [25]

    PARGO: Bridging vision-language with partial and global views,

    A.-L. Wang, B. Shan, W. Shi, K.-Y . Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huanget al., “PARGO: Bridging vision-language with partial and global views,” inPro- ceedings of the AAAI Conference on Artificial Intelligence, 2025

  26. [27]

    On learning disen- tangled representations for gait recognition,

    Z. Zhang, L. Tran, F. Liu, and X. Liu, “On learning disen- tangled representations for gait recognition,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 345–360, 2022. 9

  27. [28]

    Cross-view gait recognition by discriminative feature learning,

    Y . Zhang, Y . Huang, S. Yu, and L. Wang, “Cross-view gait recognition by discriminative feature learning,”IEEE Transactions on Image Processing, vol. 29, pp. 1001– 1015, 2020

  28. [29]

    MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

    B. Shan, X. Fei, W. Shi, A.-L. Wang, G. Tang, L. Liao, J. Tang, X. Bai, and C. Huang, “MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,” 2024

  29. [30]

    Dolphin: Document image parsing via heterogeneous anchor prompting,

    H. Feng, S. Wei, X. Fei, W. Shi, Y . Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, J. Tanget al., “Dolphin: Document image parsing via heterogeneous anchor prompting,” pp. 21 919–21 936, 2025

  30. [31]

    WildDoc: How far are we from achieving comprehensive and robust doc- ument understanding in the wild?

    A.-L. Wang, J. Tang, L. Liao, H. Feng, Q. Liu, X. Fei, J. Lu, H. Wang, H. Liu, Y . Liuet al., “WildDoc: How far are we from achieving comprehensive and robust doc- ument understanding in the wild?” 2025

  31. [32]

    Gait recognition with multiple-temporal-scale 3D convolutional neural net- work,

    B. Lin, S. Zhang, and F. Bao, “Gait recognition with multiple-temporal-scale 3D convolutional neural net- work,” inProceedings of the 28th ACM International Con- ference on Multimedia, 2020, pp. 3054–3062

  32. [33]

    Gait recognition via effective global-local feature representation and local temporal ag- gregation,

    B. Lin, S. Zhang, and X. Yu, “Gait recognition via effective global-local feature representation and local temporal ag- gregation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 648–14 656

  33. [34]

    3D local convolutional neural networks for gait recognition,

    Z. Huang, D. Xue, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, “3D local convolutional neural networks for gait recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 920–14 929

  34. [35]

    MTVQA: Benchmarking multi- lingual text-centric visual question answering,

    J. Tang, Q. Liu, Y . Ye, J. Lu, S. Wei, A.-L. Wang, C. Lin, H. Feng, Z. Zhaoet al., “MTVQA: Benchmarking multi- lingual text-centric visual question answering,” pp. 7748– 7763, 2025

  35. [36]

    A bounding box is worth one token – interleaving layout and text in a large language model for document understanding,

    J. Lu, H. Yu, Y . Wang, Y . Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wanget al., “A bounding box is worth one token – interleaving layout and text in a large language model for document understanding,” pp. 7252–7273, 2025

  36. [37]

    Sequen- tial convolutional network for behavioral pattern extraction in gait recognition,

    X. Ding, K. Wang, C. Wang, T. Lan, and L. Liu, “Sequen- tial convolutional network for behavioral pattern extraction in gait recognition,”Neurocomputing, vol. 463, pp. 411– 421, 2021

  37. [38]

    Advancing sequential nu- merical prediction in autoregressive models,

    X. Fei, J. Lu, Q. Sun, H. Feng, Y . Wang, W. Shi, A.-L. Wang, J. Tang, and C. Huang, “Advancing sequential nu- merical prediction in autoregressive models,” 2025

  38. [39]

    Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,

    W. Sun, B. Cui, J. Tang, and X.-M. Dong, “Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,” inProceed- ings of the AAAI Conference on Artificial Intelligence, 2025

  39. [40]

    SCORE: Story coherence and retrieval enhancement for AI narratives,

    Q. Yi, Y . He, J. Wang, X. Song, S. Qian, X. Yuan, Y . Xin, Y . Wang, J. Tang, Y . Liet al., “SCORE: Story coherence and retrieval enhancement for AI narratives,” 2025

  40. [41]

    Vision as LoRA,

    H. Wang, Y . Ye, B. Li, Y . Nie, J. Lu, J. Tang, Y . Wang, and C. Huang, “Vision as LoRA,”arXiv preprint arXiv:2503.20680, 2025

  41. [42]

    View-invariant gait recognition with attentive recurrent learning of partial representations,

    A. Sepas-Moghaddam and A. Etemad, “View-invariant gait recognition with attentive recurrent learning of partial representations,”IEEE Transactions on Biometrics, Be- havior, and Identity Science, vol. 3, no. 1, pp. 124–137, 2021

  42. [43]

    Spatiotemporal multi-scale bilateral motion network for gait recognition,

    X. Ding, S. Du, Y . Zhang, and K. Wang, “Spatiotemporal multi-scale bilateral motion network for gait recognition,” The Journal of Supercomputing, vol. 80, no. 3, pp. 3412– 3440, 2024

  43. [44]

    GaitSlice: A gait recognition model based on spatio-temporal slice features,

    H. Li, Y . Qiu, H. Zhao, J. Zhan, R. Chen, T. Wei, and Z. Huang, “GaitSlice: A gait recognition model based on spatio-temporal slice features,”Pattern Recognition, vol. 124, p. 108453, 2022

  44. [45]

    Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient llm/mllm reasoning.arXiv preprint arXiv:2505.15154, 2025

    J. Lu, H. Yu, S. Xu, S. Ran, G. Tang, S. Wang, B. Shan, T. Fu, H. Feng, J. Tanget al., “Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,”arXiv preprint arXiv:2505.15154, 2025

  45. [46]

    A par- tition approach for robust gait recognition based on gait template fusion,

    K. Wang, L. Liu, X. Ding, K. Yu, and G. Hu, “A par- tition approach for robust gait recognition based on gait template fusion,”Frontiers of Information Technology and Electronic Engineering, vol. 22, no. 5, pp. 709–719, 2021

  46. [47]

    A general subspace en- semble learning framework via totally-corrective boosting and tensor-based and local patch-based extensions for gait recognition,

    G. Ma, L. Wu, and Y . Wang, “A general subspace en- semble learning framework via totally-corrective boosting and tensor-based and local patch-based extensions for gait recognition,”Pattern Recognition, vol. 66, pp. 280–294, 2017

  47. [48]

    GaitStrip: Gait recognition via ef- fective strip-based feature representations and multi-level framework,

    M. Wang, B. Lin, X. Guo, L. Li, Z. Zhu, J. Sun, S. Zhang, Y . Liu, and X. Yu, “GaitStrip: Gait recognition via ef- fective strip-based feature representations and multi-level framework,” inProceedings of the 16th Asian Conference on Computer Vision. Springer, 2022, pp. 536–551

  48. [49]

    ACNet: Strength- ening the kernel skeletons for powerful CNN via asymmet- ric convolution blocks,

    X. Ding, Y . Guo, G. Ding, and J. Han, “ACNet: Strength- ening the kernel skeletons for powerful CNN via asymmet- ric convolution blocks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1911–1920

  49. [50]

    CNN features off-the-shelf: An astounding baseline for recognition,

    A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls- son, “CNN features off-the-shelf: An astounding baseline for recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813

  50. [51]

    Fo- cal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Fo- cal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

  51. [52]

    Cross-view gait recogni- tion with deep universal linear embeddings,

    S. Zhang, Y . Wang, and A. Li, “Cross-view gait recogni- tion with deep universal linear embeddings,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9095–9104

  52. [53]

    Gait recognition using multi-scale partial rep- resentation transformation with capsules,

    A. Sepas-Moghaddam, S. Ghorbani, N. F. Troje, and A. Etemad, “Gait recognition using multi-scale partial rep- resentation transformation with capsules,” inProceedings of the 25th International Conference on Pattern Recogni- tion (ICPR). IEEE, 2021, pp. 8045–8052. 10