arxiv: 2605.14548 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Local Spatiotemporal Convolutional Network for Robust Gait Recognition

Xiaoyun Wang , Cunrong Li , Wu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords gait recognitionspatiotemporal convolution2D convolutional networktemporal feature extractionbiometric identificationvideo motion analysisstrip-based pooling

0 comments

The pith

A dual-branch network endows standard 2D convolutions with the ability to extract temporal gait motion patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Local Spatiotemporal Convolutional Network to extract walking dynamics from video frames without relying on complex 3D convolutions or recurrent models. It introduces Global Bidirectional Spatial Pooling to break gait features into horizontal and vertical strips, letting the temporal dimension participate directly in ordinary 2D convolution operations. A Local Spatiotemporal Convolutional layer then learns adaptive strip-based motion patterns, with asymmetric kernels further enriching representations across separate domains. This keeps the architecture simple while addressing interference from viewpoint, clothing, and carrying variations. A sympathetic reader would care because gait offers practical non-invasive identification over distance, yet most current solutions either ignore time or demand heavy computation.

Core claim

The central claim is that the LSTCN dual-branch architecture, built on Global Bidirectional Spatial Pooling to decompose gait tensors into horizontal and vertical strip-based local representations, combined with Local Spatiotemporal Convolutional layers and asymmetric kernels, allows standard 2D convolutional networks to jointly process temporal and spatial dimensions and thereby capture intrinsic motion patterns for robust gait recognition.

What carries the argument

Global Bidirectional Spatial Pooling (GBSP) that reduces spatial dimensionality into strip-based local representations so temporal information can enter standard 2D convolutions, plus the Local Spatiotemporal Convolutional (LSTC) layer that adaptively learns motion patterns.

If this is right

Standard 2D convolutional networks gain the capacity to extract temporal information from consecutive gait frames.
The approach reduces reliance on computationally heavy sequential models such as LSTMs or 3D convolutions.
Asymmetric convolution kernels independently attend to temporal, spatial, and joint spatiotemporal domains to enrich features.
Local strip representations help the network adaptively learn gait motion patterns under external variations.
The overall architecture remains structurally simple while processing video data complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The strip-based reduction technique could transfer to other video tasks where temporal modeling is needed but 3D compute is unavailable.
Dual-branch design may offer a template for balancing local motion cues with global appearance in related biometrics.
Further tests on mobile hardware would check whether the efficiency gains enable real-time gait applications.
The method might reveal which local spatial strips carry the most identity-discriminative temporal signals.

Load-bearing premise

That reducing gait tensors to horizontal and vertical strip-based local representations preserves enough discriminative motion information to handle covariate changes without major loss.

What would settle it

If the LSTCN model achieves lower accuracy than a standard 3D convolutional baseline on a gait benchmark containing viewpoint, clothing, and carrying variations, the claim that the strip-based reduction enables effective temporal capture would be challenged.

read the original abstract

Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a simple dual-branch CNN for gait that uses strip pooling to add temporal capacity to 2D convolutions, but offers no results to back it up.

read the letter

The main thing here is a dual-branch network called LSTCN that tries to give 2D CNNs temporal power for gait recognition without going full 3D or recurrent. They do this with Global Bidirectional Spatial Pooling that splits the feature maps into horizontal and vertical strips, and then Local Spatiotemporal Convolutional layers that use asymmetric kernels to focus on temporal, spatial, and combined patterns separately. This keeps the model lighter than 3D convs or LSTMs, which is a plus for practical use in surveillance where you need to handle viewpoint, clothing, and carrying changes. The novelty looks real in the specific combination of strip-based bidirectional pooling and the independent attention in the LSTC layers. It doesn't directly copy prior gait papers. The soft spot is the complete lack of evidence. The abstract lays out the motivation and design but gives zero numbers, no ablation on the pooling or kernel choices, and no tests against standard datasets. That makes it hard to know if the strip pooling preserves the fine details needed for discrimination or if it smooths them away, as the stress-test note suggests. The central assumption that these local representations capture intrinsic motion without loss under covariates is plausible but untested here. This paper is for researchers in biometrics or efficient video processing who are exploring alternatives to heavy temporal models. A reader could pick up the architectural ideas for their own work, but only as a starting point. It deserves a serious referee if the full version includes solid experiments and comparisons. Based on what's here, I'd say no, because there's nothing to review yet beyond the proposal.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Local Spatiotemporal Convolutional Network (LSTCN) for gait recognition, a dual-branch architecture that augments standard 2D CNNs with temporal modeling capacity. It introduces Global Bidirectional Spatial Pooling (GBSP) to decompose gait tensors into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in 2D convolutions, and designs Local Spatiotemporal Convolutional (LSTC) layers with asymmetric kernels that jointly process temporal, spatial, and joint spatiotemporal domains to learn strip-based motion patterns.

Significance. If the empirical claims hold, the work offers a structurally simple alternative to 3D convolutions or recurrent models for extracting intrinsic gait motion under covariates, with potential computational advantages. The explicit design of asymmetric kernels and the GBSP reduction mechanism are clear contributions that could be adopted in other video-based recognition tasks.

major comments (2)

[GBSP mechanism and LSTC layer description] The central claim that GBSP strip pooling preserves fine-grained temporal variations (e.g., localized limb trajectories) without meaningful loss of discriminative information under clothing, bag, or viewpoint changes is load-bearing but unsupported by any analysis or ablation in the provided text. The description in the abstract and architecture section implies intra-strip averaging occurs, yet no quantitative demonstration (e.g., comparison of feature discriminability before/after pooling on covariate-altered silhouettes) is given to refute the risk that this smoothing degrades input to the LSTC layers.
[Abstract and proposed method] No quantitative results, ablation studies, or error analysis appear in the abstract or architecture sections to verify that the dual-branch LSTCN actually improves recognition accuracy or robustness relative to baselines (standard 2D CNNs, 3D CNNs, or LSTMs). Without these, the assertion that the architecture 'endows standard two-dimensional convolutional networks with the capacity to extract temporal information' remains unverified.

minor comments (1)

[LSTC layer] Notation for the asymmetric convolution kernels (temporal, spatial, and joint domains) should be formalized with explicit equations or a diagram showing kernel dimensions and how they are applied to the GBSP outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below and will revise the paper to incorporate additional supporting analysis and clarifications.

read point-by-point responses

Referee: [GBSP mechanism and LSTC layer description] The central claim that GBSP strip pooling preserves fine-grained temporal variations (e.g., localized limb trajectories) without meaningful loss of discriminative information under clothing, bag, or viewpoint changes is load-bearing but unsupported by any analysis or ablation in the provided text. The description in the abstract and architecture section implies intra-strip averaging occurs, yet no quantitative demonstration (e.g., comparison of feature discriminability before/after pooling on covariate-altered silhouettes) is given to refute the risk that this smoothing degrades input to the LSTC layers.

Authors: We acknowledge that the current manuscript does not include explicit quantitative analysis of the GBSP mechanism's effect on temporal feature preservation. In the revised version, we will add a dedicated ablation study in the Experiments section. This will compare feature discriminability metrics (such as class separability via Fisher discriminant ratio and reconstruction fidelity of limb trajectories) on covariate-altered gait silhouettes before and after GBSP pooling, directly addressing the potential smoothing effects and demonstrating retention of fine-grained motion patterns. revision: yes
Referee: [Abstract and proposed method] No quantitative results, ablation studies, or error analysis appear in the abstract or architecture sections to verify that the dual-branch LSTCN actually improves recognition accuracy or robustness relative to baselines (standard 2D CNNs, 3D CNNs, or LSTMs). Without these, the assertion that the architecture 'endows standard two-dimensional convolutional networks with the capacity to extract temporal information' remains unverified.

Authors: The manuscript presents quantitative comparisons against 2D CNN, 3D CNN, and LSTM baselines, along with ablation studies on the dual-branch design and LSTC layers, in the Experiments section (including results on CASIA-B and other gait datasets under clothing, bag, and viewpoint covariates). To improve accessibility, we will revise the abstract to include a concise summary of key accuracy improvements and add a short paragraph in the architecture section referencing the empirical verification from the experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents LSTCN as an architectural proposal consisting of GBSP strip pooling and LSTC layers with asymmetric kernels. These components are introduced as independent design decisions to enable standard 2D convolutions to process temporal information from gait tensors. No equations, fitted parameters, or predictions are defined that reduce by construction to the inputs or to self-citations. The central claims rest on the empirical performance of the proposed structure rather than any definitional equivalence or load-bearing self-reference chain. The derivation is self-contained as a standard neural architecture contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that strip pooling preserves sufficient gait dynamics for 2D convolution to learn temporal patterns; no external benchmarks or prior proofs are invoked in the abstract.

free parameters (1)

asymmetric convolution kernel dimensions
Kernel sizes chosen independently for temporal, spatial, and joint domains; specific values not stated in abstract.

axioms (1)

domain assumption Decomposing spatial features into horizontal and vertical strips enables temporal information to participate in standard 2D convolution without loss of motion discriminability.
Invoked in the definition of GBSP and LSTC layers.

invented entities (2)

Global Bidirectional Spatial Pooling (GBSP) no independent evidence
purpose: Reduce dimensionality of gait tensors into strip-based local representations
Newly introduced mechanism to enable temporal processing in 2D CNNs.
Local Spatiotemporal Convolutional (LSTC) layer no independent evidence
purpose: Jointly process temporal and spatial dimensions with adaptive strip-based patterns
Core new layer proposed in the architecture.

pith-pipeline@v0.9.0 · 5549 in / 1290 out tokens · 54478 ms · 2026-05-15T02:03:29.160864+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Global Bidirectional Spatial Pooling (GBSP) ... decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Local Spatiotemporal Convolutional (LSTC) layer ... asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Overview of biometrics research,

Z. Sun, R. He, L. Wang, M. Kan, J. Feng, F. Zheng, W. Zheng, W. Zuo, W. Kang, W. Deng, J. Zhang, H. Han, S. Shan, Y . Wang, Y . Ru, Y . Zhu, Y . Liu, and Y . He, “Overview of biometrics research,”Journal of Image and Graphics, vol. 26, no. 6, pp. 1254–1329, 2021

work page 2021
[2]

Deep gait recog- nition: A survey,

A. Sepas-Moghaddam and A. Etemad, “Deep gait recog- nition: A survey,”IEEE Transactions on Pattern Analy- sis and Machine Intelligence, vol. 45, no. 1, pp. 264–284, 2023

work page 2023
[3]

Few could be better than all: Feature sampling and grouping for scene text detection,

J. Tang, W. Zhang, H. Liu, M. Yang, B. Jiang, G. Hu, and X. Bai, “Few could be better than all: Feature sampling and grouping for scene text detection,” inProceedings of 8 the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 4563–4572

work page 2022
[4]

Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,

J. Tang, W. Qian, L. Song, X. Dong, L. Li, and X. Bai, “Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,” inProceedings of the European Confer- ence on Computer Vision, 2022, pp. 233–248

work page 2022
[5]

A com- prehensive study on cross-view gait based human identi- fication with deep CNNs,

Z. Wu, Y . Huang, L. Wang, X. Wang, and T. Tan, “A com- prehensive study on cross-view gait based human identi- fication with deep CNNs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 2, pp. 209– 226, 2017

work page 2017
[6]

Character recognition competi- tion for street view shop signs,

J. Tang, W. Du, B. Wang, W. Zhou, S. Mei, T. Xue, X. Xu, and H. Zhang, “Character recognition competi- tion for street view shop signs,”National Science Review, vol. 10, no. 6, p. nwad141, 2023

work page 2023
[7]

You can even annotate text with voice: Transcription- only-supervised text spotting,

J. Tang, S. Qiao, B. Cui, Y . Ma, S. Zhang, and D. Kanoulas, “You can even annotate text with voice: Transcription- only-supervised text spotting,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4154–4163

work page 2022
[8]

A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition,

S. Yu, D. Tan, and T. Tan, “A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition,” inProceedings of the 18th International Conference on Pattern Recognition (ICPR). IEEE, 2006, pp. 441–444

work page 2006
[9]

TextSquare: Scaling up text-centric visual instruction tuning,

J. Tang, C. Lin, Z. Zhao, S. Wei, B. Wu, Q. Liu, Y . He, K. Lu, H. Feng, Y . Liet al., “TextSquare: Scaling up text-centric visual instruction tuning,”arXiv preprint arXiv:2404.12803, 2024

work page arXiv 2024
[10]

DocPedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,

H. Feng, Q. Liu, H. Liu, J. Tang, W. Zhou, H. Li, and C. Huang, “DocPedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,”Science China Information Sciences, 2024

work page 2024
[11]

Coupled patch alignment for matching cross-view gaits,

X. Ben, C. Gong, P. Zhang, X. Jia, Q. Wu, and W. Meng, “Coupled patch alignment for matching cross-view gaits,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 3142–3157, 2019

work page 2019
[12]

Gaitset: Regard- ing gait as a set for cross-view gait recognition,

H. Chao, Y . He, J. Zhang, and J. Feng, “Gaitset: Regard- ing gait as a set for cross-view gait recognition,” inPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, 2019, pp. 8126–8133

work page 2019
[13]

UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,”arXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023
[14]

A strong and robust skeleton-based gait recognition method with gait periodicity priors,

N. Li and X. Zhao, “A strong and robust skeleton-based gait recognition method with gait periodicity priors,”IEEE Transactions on Multimedia, vol. 25, pp. 3046–3058, 2023

work page 2023
[15]

GaitPart: Temporal part-based model for gait recognition,

C. Fan, Y . Peng, C. Cao, X. Liu, S. Hou, J. Chi, Y . Huang, Q. Li, and Z. He, “GaitPart: Temporal part-based model for gait recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 225–14 233

work page 2020
[16]

Multifaceted-features enhancement-relevant gait recogni- tion method,

S. Hou, Y . Fu, A. Li, X. Liu, C. Cao, and Y . Huang, “Multifaceted-features enhancement-relevant gait recogni- tion method,”Journal of Image and Graphics, vol. 28, no. 5, pp. 1477–1486, 2023

work page 2023
[17]

TabPedia: Towards com- prehensive visual table understanding with concept syn- ergy,

W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, W. Zhouet al., “TabPedia: Towards com- prehensive visual table understanding with concept syn- ergy,” inAdvances in Neural Information Processing Sys- tems, 2024

work page 2024
[18]

Deep high- resolution representation learning for human pose esti- mation,

K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high- resolution representation learning for human pose esti- mation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703

work page 2019
[19]

SPTS v2: Single- point scene text spotting,

Y . Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Baiet al., “SPTS v2: Single- point scene text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 249–15 264, 2023

work page 2023
[20]

Performance evaluation of model-based gait on multi-view very large population database with pose se- quences,

W. An, S. Yu, Y . Makihara, X. Wu, C. Xu, Y . Yu, R. Liao, and Y . Yagi, “Performance evaluation of model-based gait on multi-view very large population database with pose se- quences,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 2, no. 4, pp. 421–430, 2020

work page 2020
[21]

Pose- based temporal-spatial network (PTSN) for gait recogni- tion with carrying and clothing variations,

R. Liao, C. Cao, E. B. Garcia, S. Yu, and Y . Huang, “Pose- based temporal-spatial network (PTSN) for gait recogni- tion with carrying and clothing variations,” inProceedings of the 12th Chinese Conference on Biometric Recognition (CCBR). Springer, 2017, pp. 474–483

work page 2017
[22]

Gaitgraph: Graph convolutional network for skeleton-based gait recognition,

T. Teepe, A. Khan, J. Gilg, F. Herzog, S. H ¨ormann, and G. Rigoll, “Gaitgraph: Graph convolutional network for skeleton-based gait recognition,” inProceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 2314–2318

work page 2021
[23]

Towards a deeper understanding of skeleton-based gait recognition,

T. Teepe, J. Gilg, F. Herzog, S. H ¨ormann, and G. Rigoll, “Towards a deeper understanding of skeleton-based gait recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022, pp. 1569–1577

work page 2022
[24]

Harmonizing visual text comprehension and generation,

Z. Zhao, J. Tang, B. Wu, C. Lin, S. Wei, H. Liu, X. Tan, Z. Zhang, C. Huang, and Y . Xie, “Harmonizing visual text comprehension and generation,” inAdvances in Neural In- formation Processing Systems, 2024

work page 2024
[25]

PARGO: Bridging vision-language with partial and global views,

A.-L. Wang, B. Shan, W. Shi, K.-Y . Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huanget al., “PARGO: Bridging vision-language with partial and global views,” inPro- ceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[27]

On learning disen- tangled representations for gait recognition,

Z. Zhang, L. Tran, F. Liu, and X. Liu, “On learning disen- tangled representations for gait recognition,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 345–360, 2022. 9

work page 2022
[28]

Cross-view gait recognition by discriminative feature learning,

Y . Zhang, Y . Huang, S. Yu, and L. Wang, “Cross-view gait recognition by discriminative feature learning,”IEEE Transactions on Image Processing, vol. 29, pp. 1001– 1015, 2020

work page 2020
[29]

MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

B. Shan, X. Fei, W. Shi, A.-L. Wang, G. Tang, L. Liao, J. Tang, X. Bai, and C. Huang, “MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,” 2024

work page 2024
[30]

Dolphin: Document image parsing via heterogeneous anchor prompting,

H. Feng, S. Wei, X. Fei, W. Shi, Y . Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, J. Tanget al., “Dolphin: Document image parsing via heterogeneous anchor prompting,” pp. 21 919–21 936, 2025

work page 2025
[31]

WildDoc: How far are we from achieving comprehensive and robust doc- ument understanding in the wild?

A.-L. Wang, J. Tang, L. Liao, H. Feng, Q. Liu, X. Fei, J. Lu, H. Wang, H. Liu, Y . Liuet al., “WildDoc: How far are we from achieving comprehensive and robust doc- ument understanding in the wild?” 2025

work page 2025
[32]

Gait recognition with multiple-temporal-scale 3D convolutional neural net- work,

B. Lin, S. Zhang, and F. Bao, “Gait recognition with multiple-temporal-scale 3D convolutional neural net- work,” inProceedings of the 28th ACM International Con- ference on Multimedia, 2020, pp. 3054–3062

work page 2020
[33]

Gait recognition via effective global-local feature representation and local temporal ag- gregation,

B. Lin, S. Zhang, and X. Yu, “Gait recognition via effective global-local feature representation and local temporal ag- gregation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 648–14 656

work page 2021
[34]

3D local convolutional neural networks for gait recognition,

Z. Huang, D. Xue, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, “3D local convolutional neural networks for gait recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 920–14 929

work page 2021
[35]

MTVQA: Benchmarking multi- lingual text-centric visual question answering,

J. Tang, Q. Liu, Y . Ye, J. Lu, S. Wei, A.-L. Wang, C. Lin, H. Feng, Z. Zhaoet al., “MTVQA: Benchmarking multi- lingual text-centric visual question answering,” pp. 7748– 7763, 2025

work page 2025
[36]

A bounding box is worth one token – interleaving layout and text in a large language model for document understanding,

J. Lu, H. Yu, Y . Wang, Y . Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wanget al., “A bounding box is worth one token – interleaving layout and text in a large language model for document understanding,” pp. 7252–7273, 2025

work page 2025
[37]

Sequen- tial convolutional network for behavioral pattern extraction in gait recognition,

X. Ding, K. Wang, C. Wang, T. Lan, and L. Liu, “Sequen- tial convolutional network for behavioral pattern extraction in gait recognition,”Neurocomputing, vol. 463, pp. 411– 421, 2021

work page 2021
[38]

Advancing sequential nu- merical prediction in autoregressive models,

X. Fei, J. Lu, Q. Sun, H. Feng, Y . Wang, W. Shi, A.-L. Wang, J. Tang, and C. Huang, “Advancing sequential nu- merical prediction in autoregressive models,” 2025

work page 2025
[39]

Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,

W. Sun, B. Cui, J. Tang, and X.-M. Dong, “Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,” inProceed- ings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[40]

SCORE: Story coherence and retrieval enhancement for AI narratives,

Q. Yi, Y . He, J. Wang, X. Song, S. Qian, X. Yuan, Y . Xin, Y . Wang, J. Tang, Y . Liet al., “SCORE: Story coherence and retrieval enhancement for AI narratives,” 2025

work page 2025
[41]

Vision as LoRA,

H. Wang, Y . Ye, B. Li, Y . Nie, J. Lu, J. Tang, Y . Wang, and C. Huang, “Vision as LoRA,”arXiv preprint arXiv:2503.20680, 2025

work page arXiv 2025
[42]

View-invariant gait recognition with attentive recurrent learning of partial representations,

A. Sepas-Moghaddam and A. Etemad, “View-invariant gait recognition with attentive recurrent learning of partial representations,”IEEE Transactions on Biometrics, Be- havior, and Identity Science, vol. 3, no. 1, pp. 124–137, 2021

work page 2021
[43]

Spatiotemporal multi-scale bilateral motion network for gait recognition,

X. Ding, S. Du, Y . Zhang, and K. Wang, “Spatiotemporal multi-scale bilateral motion network for gait recognition,” The Journal of Supercomputing, vol. 80, no. 3, pp. 3412– 3440, 2024

work page 2024
[44]

GaitSlice: A gait recognition model based on spatio-temporal slice features,

H. Li, Y . Qiu, H. Zhao, J. Zhan, R. Chen, T. Wei, and Z. Huang, “GaitSlice: A gait recognition model based on spatio-temporal slice features,”Pattern Recognition, vol. 124, p. 108453, 2022

work page 2022
[45]

Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient llm/mllm reasoning.arXiv preprint arXiv:2505.15154, 2025

J. Lu, H. Yu, S. Xu, S. Ran, G. Tang, S. Wang, B. Shan, T. Fu, H. Feng, J. Tanget al., “Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,”arXiv preprint arXiv:2505.15154, 2025

work page arXiv 2025
[46]

A par- tition approach for robust gait recognition based on gait template fusion,

K. Wang, L. Liu, X. Ding, K. Yu, and G. Hu, “A par- tition approach for robust gait recognition based on gait template fusion,”Frontiers of Information Technology and Electronic Engineering, vol. 22, no. 5, pp. 709–719, 2021

work page 2021
[47]

A general subspace en- semble learning framework via totally-corrective boosting and tensor-based and local patch-based extensions for gait recognition,

G. Ma, L. Wu, and Y . Wang, “A general subspace en- semble learning framework via totally-corrective boosting and tensor-based and local patch-based extensions for gait recognition,”Pattern Recognition, vol. 66, pp. 280–294, 2017

work page 2017
[48]

GaitStrip: Gait recognition via ef- fective strip-based feature representations and multi-level framework,

M. Wang, B. Lin, X. Guo, L. Li, Z. Zhu, J. Sun, S. Zhang, Y . Liu, and X. Yu, “GaitStrip: Gait recognition via ef- fective strip-based feature representations and multi-level framework,” inProceedings of the 16th Asian Conference on Computer Vision. Springer, 2022, pp. 536–551

work page 2022
[49]

ACNet: Strength- ening the kernel skeletons for powerful CNN via asymmet- ric convolution blocks,

X. Ding, Y . Guo, G. Ding, and J. Han, “ACNet: Strength- ening the kernel skeletons for powerful CNN via asymmet- ric convolution blocks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1911–1920

work page 2019
[50]

CNN features off-the-shelf: An astounding baseline for recognition,

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls- son, “CNN features off-the-shelf: An astounding baseline for recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813

work page 2014
[51]

Fo- cal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Fo- cal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

work page 2017
[52]

Cross-view gait recogni- tion with deep universal linear embeddings,

S. Zhang, Y . Wang, and A. Li, “Cross-view gait recogni- tion with deep universal linear embeddings,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9095–9104

work page 2021
[53]

Gait recognition using multi-scale partial rep- resentation transformation with capsules,

A. Sepas-Moghaddam, S. Ghorbani, N. F. Troje, and A. Etemad, “Gait recognition using multi-scale partial rep- resentation transformation with capsules,” inProceedings of the 25th International Conference on Pattern Recogni- tion (ICPR). IEEE, 2021, pp. 8045–8052. 10

work page 2021