pith. machine review for the scientific record. sign in

arxiv: 2604.07578 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords rodent behavior recognitionmulti-scale transformerpose-based classificationsocial behavior analysistemporal attentionRatSICalMS21
0
0 comments X

The pith

A multi-scale global-local transformer recognizes rodent social behaviors from pose sequences more accurately than prior models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MSGL-Transformer to classify social behaviors in rodents automatically from temporal sequences of pose keypoints, replacing slow and error-prone manual scoring. The architecture uses parallel attention branches operating at short, medium, and global temporal ranges together with a modulation block that reweights features to emphasize behavior-relevant patterns. If the approach holds, it would allow faster, more consistent processing of large-scale behavioral recordings in neuroscience. The same design is shown to work on two different datasets after only minor adjustments to input size and output classes.

Core claim

MSGL-Transformer uses a lightweight transformer encoder whose multi-scale attention consists of parallel short-range, medium-range, and global branches that explicitly capture motion dynamics at different temporal scales, combined with a Behavior-Aware Modulation block that adjusts temporal embeddings to highlight behavior-relevant features before attention is applied.

What carries the argument

Multi-scale attention mechanism formed by three parallel branches (short-range, medium-range, global) plus the Behavior-Aware Modulation (BAM) block that modulates embeddings in the style of squeeze-and-excitation networks.

If this is right

  • Outperforms TCN, LSTM, and Bi-LSTM baselines on the RatSI dataset, reaching 75.4 percent mean accuracy across nine cross-validation splits.
  • Achieves 87.1 percent accuracy and 0.8745 F1 on CalMS21, a 10.7 percent gain over HSTWFormer while also beating ST-GCN, MS-G3D, CTR-GCN, and STGAT.
  • The identical architecture works on both five-class and four-class problems after changing only input dimensionality and number of output classes.
  • Explicit separation of attention scales makes the contribution of each temporal range directly observable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success would make large-scale automated analysis of rodent social behavior practical for labs that currently rely on manual scoring.
  • The design could be tested on pose data from other species or on behaviors that involve more than two animals.
  • Future work could measure how much performance depends on the quality of the upstream pose tracker by injecting controlled tracking errors.

Load-bearing premise

The 12-dimensional or 28-dimensional pose keypoints supplied as input are accurate and complete representations of the animals' movements.

What would settle it

Apply the trained model to the same videos but replace the clean keypoints with versions that contain realistic tracking noise or missing joints and measure whether accuracy falls substantially below the reported 75.4 percent and 87.1 percent figures.

Figures

Figures reproduced from arXiv: 2604.07578 by Doina Caragea, Muhammad Imran Sharif.

Figure 1
Figure 1. Figure 1: Four consecutive frames for each of the five behaviors in the RatSI18 dataset, showing how each behavior develops over time. Behavior instance durations vary across videos, with Solitary and Following typically spanning longer periods and Moving Away occurring over shorter spans. by a short head movement that spans only a few frames. Due to this reason a model that focuses solely on global sequence pattern… view at source ↗
Figure 2
Figure 2. Figure 2: Sample frames from the RatSI dataset showing the five social behaviors studied in this work: Solitary, Approaching, Following, Moving Away, and Social Nose Contact, spanning from no social engagement to direct physical contact. Rare behaviors such as Pinning and Nape Attacking were excluded because they occur infrequently and introduce severe class imbalance. The distribution of all annotated behaviors acr… view at source ↗
Figure 3
Figure 3. Figure 3: Representative video frames from the CalMS21 dataset illustrating the four annotated behavior classes. The resident (black) mouse initiates all active behaviors toward the intruder (white) mouse in the resident-intruder assay24 . frames for each of the four behavior classes in the CalMS21 dataset [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed MSGL-Transformer architecture. The model combines global token generation, behavior-aware modulation, and multi-scale attention within a lightweight transformer encoder to jointly capture global context and fine-grained temporal dynamics in rodent social interactions. The concatenated sequence of the global token and the embedded pose vectors is combined with a learnable positional… view at source ↗
Figure 5
Figure 5. Figure 5: Structure of the multi-scale attention module. The modulated input Zˆ 0 is passed through three parallel branches: short-range causal attention on the first ⌊T/2⌋ frames (this branch focuses on rapid motion cues occurring within short temporal contexts), medium-range causal attention on all T frames, and global bidirectional attention on all T +1 tokens. The short-range and medium-range outputs are average… view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy, precision, recall, and F1-score across three representative splits in RatSI: Valid-7–Test-3 (best-performing), Valid-2–Test-8 (near the mean), and Valid-3–Test-9 (most challenging). The performance of the Social Nose Contact class varies noticeably across the considered splits, where an F1-score of 0.624 is achieved in Valid-2–Test-8, whereas the performance drops to 0.348 in Valid-7–Test-3. It i… view at source ↗
Figure 7
Figure 7. Figure 7: ROC curves across the three representative splits of the RatSI dataset. 4.1.2 Confusion Matrix Analysis The confusion matrices for the three representative splits are illustrated in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Confusion matrices obtained for the Valid-2–Test-8, Valid-3–Test-9, and Valid-7–Test-3 splits. LSTM, and Bi-LSTM50. All models were trained using the same preprocessing pipeline, window length (T = 35), optimizer, and loss function to ensure a fair comparison [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ROC curves of the MSGL-Transformer on the CalMS21 test set, showing per-class discrimination performance [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Confusion matrix of the MSGL-Transformer on the CalMS21 test set. We can see the model does best on the dominant Other class, while Attack is the hardest class. This reflects the severe class imbalance in the dataset. 4.2.5 Boundary Error Analysis To better understand the conditions under which the model fails, prediction accuracy is analyzed as a function of the temporal distance from the behavior transi… view at source ↗
Figure 11
Figure 11. Figure 11: Boundary error analysis results on the CalMS21 dataset. The left plot illustrates the variation in prediction accuracy with respect to the distance from the nearest behavior transition, while the right plot presents the distribution of samples observed at each distance. two behaviors often transition into each other during natural mouse interactions, and the sliding-window formulation may include frames f… view at source ↗
read the original abstract

Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents MSGL-Transformer, a Multi-Scale Global-Local Transformer for rodent social behavior recognition from pose sequences. It features a lightweight transformer with parallel short-, medium-, and global-range attention branches, along with a Behavior-Aware Modulation (BAM) block to emphasize relevant features. Evaluations on the RatSI (5 classes, 12D pose) and CalMS21 (4 classes, 28D pose) datasets report mean accuracies of 75.4% and 87.1%, respectively, with F1 scores of 0.745 and 0.8745, outperforming baselines including TCN, LSTM, Bi-LSTM, HSTWFormer, ST-GCN, MS-G3D, CTR-GCN, and STGAT. The architecture is claimed to generalize across datasets with only adjustments to input dimensions and class counts.

Significance. If the performance claims are substantiated through additional verification, the work offers a promising direction for automated analysis of rodent social behaviors, which is valuable for neuroscience and behavioral studies. The multi-scale attention mechanism addresses the temporal variability in behaviors, and the consistent architecture across two datasets demonstrates potential for broader applicability. The use of public datasets and direct comparisons to published baselines is a strength.

major comments (3)
  1. [Experimental Evaluation] The reported mean accuracy of 75.4% on RatSI across nine cross-validation splits and 87.1% on CalMS21 lack error bars, standard deviations, or statistical significance tests. This makes it challenging to determine whether the improvements over baselines such as TCN, LSTM, and HSTWFormer are statistically meaningful.
  2. [Method and Experiments] No ablation studies are provided for the multi-scale attention branches (short-range, medium-range, global) or the Behavior-Aware Modulation (BAM) block. Since these are the core innovations, their individual contributions to the reported F1 scores (0.745 on RatSI, 0.8745 on CalMS21) cannot be verified, undermining the central architectural claims.
  3. [Introduction and Evaluation] The model relies on 12D and 28D pose keypoints without any experiments testing sensitivity to tracking errors, occlusions, or missing joints. Given that rodent pose estimation is prone to such issues and the architecture lacks explicit noise-handling mechanisms, the generalization claim across datasets may not hold in practical, noisy conditions.
minor comments (2)
  1. [Abstract] The abstract states 'nine cross-validation splits' for RatSI but does not specify the split strategy or dataset size, which would aid reproducibility.
  2. [Method] Additional details on the exact integration of the BAM block with the attention branches would improve clarity of the method description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and will incorporate revisions where they strengthen the work.

read point-by-point responses
  1. Referee: The reported mean accuracy of 75.4% on RatSI across nine cross-validation splits and 87.1% on CalMS21 lack error bars, standard deviations, or statistical significance tests. This makes it challenging to determine whether the improvements over baselines such as TCN, LSTM, and HSTWFormer are statistically meaningful.

    Authors: We agree that reporting variability and statistical tests would improve the evaluation section. The means are averaged over nine cross-validation splits on RatSI, yet standard deviations and error bars were omitted. In the revised manuscript we will add standard deviations for all metrics, include error bars in the tables, and perform paired statistical tests (e.g., Wilcoxon signed-rank) against the baselines to confirm the significance of the reported gains. revision: yes

  2. Referee: No ablation studies are provided for the multi-scale attention branches (short-range, medium-range, global) or the Behavior-Aware Modulation (BAM) block. Since these are the core innovations, their individual contributions to the reported F1 scores (0.745 on RatSI, 0.8745 on CalMS21) cannot be verified, undermining the central architectural claims.

    Authors: We acknowledge that the absence of ablation studies leaves the contribution of each proposed component unquantified. The original submission presented the full model and its overall results but did not isolate the branches or the BAM block. We will conduct the necessary ablation experiments for the revised version, reporting accuracy and F1 scores on both datasets when each attention branch and the BAM block are removed individually. revision: yes

  3. Referee: The model relies on 12D and 28D pose keypoints without any experiments testing sensitivity to tracking errors, occlusions, or missing joints. Given that rodent pose estimation is prone to such issues and the architecture lacks explicit noise-handling mechanisms, the generalization claim across datasets may not hold in practical, noisy conditions.

    Authors: This is a fair observation about real-world robustness. Although the architecture generalizes across the two datasets with different input dimensions, we did not evaluate performance under simulated pose-estimation artifacts. In the revision we will add controlled experiments that introduce missing joints, occlusions, and additive noise to the keypoint sequences and report the resulting performance drops on both datasets, together with a short discussion of possible noise-robust extensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML evaluation on public datasets with independent baselines

full rationale

The paper describes a transformer architecture (multi-scale attention branches + BAM block) and reports accuracies/F1 scores obtained by training and evaluating on fixed public datasets (RatSI, CalMS21) under standard cross-validation. No equations, predictions, or uniqueness claims reduce the reported results to fitted constants, self-citations, or input redefinitions by construction. The architecture choices are presented as design decisions, not derived from prior self-work that would force the outcomes. Performance gains are measured against published external baselines, making the central claims self-contained and falsifiable outside any internal loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard transformer attention can be directly applied to low-dimensional pose sequences without additional inductive biases for skeletal structure; no free parameters are explicitly fitted beyond normal training, and no new entities are postulated.

axioms (1)
  • domain assumption Pose keypoints provide a sufficient and low-noise representation of rodent social behavior
    The model takes 12D or 28D pose inputs as given; any tracking error would directly degrade the reported accuracies.

pith-pipeline@v0.9.0 · 5554 in / 1318 out tokens · 18875 ms · 2026-05-10T17:50:53.486356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Animal use in neurobiological research.Neuroscience433, 1–10 (2020)

    ˙Zakowski, W. Animal use in neurobiological research.Neuroscience433, 1–10 (2020)

  2. [2]

    & Youn, J

    Ellenbroek, B. & Youn, J. Rodent models in neuroscience research: is it a rat race?Dis. models & mechanisms9, 1079–1087 (2016)

  3. [3]

    Bryda, E. C. The mighty mouse: the impact of rodents on advances in biomedical research.Mo. medicine110, 207 (2013)

  4. [4]

    Neuroanatomical substrates of rodent social behavior: the medial prefrontal cortex and its projection patterns.Front

    Ko, J. Neuroanatomical substrates of rodent social behavior: the medial prefrontal cortex and its projection patterns.Front. neural circuits11, 41 (2017)

  5. [5]

    Pharmacol.14, 1329424 (2024)

    Popik, P.et al.Effects of ketamine on rat social behavior as analyzed by deeplabcut and simba deep learning algorithms.Front. Pharmacol.14, 1329424 (2024)

  6. [6]

    & Marklund, N

    Hånell, A. & Marklund, N. Structured evaluation of rodent behavioral tests used in drug discovery research.Front. behavioral neuroscience8, 252 (2014)

  7. [7]

    A., Afzal, A., Warraich, Z

    Desland, F. A., Afzal, A., Warraich, Z. & Mocco, J. Manual versus automated rodent behavioral assessment: comparing efficacy and ease of bederson and garcia neurological deficit scores to an open field video-tracking system.J. central nervous system disease6, JCNSD–S13194 (2014)

  8. [8]

    learning memory165, 106780 (2019)

    Gulinello, M.et al.Rigor and reproducibility in rodent behavioral research.Neurobiol. learning memory165, 106780 (2019)

  9. [9]

    & Aguiar, P

    Gerós, A., Magalhães, A. & Aguiar, P. Improved 3d tracking and automated classification of rodents’ behavioral activity using depth-sensing cameras.Behav. research methods52, 2156–2167 (2020). 11.V on Ziegler, L., Sturman, O. & Bohacek, J. Big behavior: challenges and opportunities in a new era of deep behavior profiling.Neuropsychopharmacology46, 33–44 (2021)

  10. [10]

    S.et al.Beyond observation: Deep learning for animal behavior and ecological conservation

    Saoud, L. S.et al.Beyond observation: Deep learning for animal behavior and ecological conservation. Ecol. Informatics102893 (2024)

  11. [11]

    P.et al.Deepethogram, a machine learning pipeline for supervised behavior classification from raw pixels.elife10, e63377 (2021)

    Bohnslav, J. P.et al.Deepethogram, a machine learning pipeline for supervised behavior classification from raw pixels.elife10, e63377 (2021)

  12. [12]

    Segalin, C.et al.The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice.Elife10, e63720 (2021)

  13. [13]

    Y .et al.Using deep learning to study emotional behavior in rodent models.Front

    Kuo, J. Y .et al.Using deep learning to study emotional behavior in rodent models.Front. Behav. Neurosci.16, 1044492 (2022)

  14. [14]

    A., Noldus, L

    Van Dam, E. A., Noldus, L. P. & Van Gerven, M. A. Disentangling rodent behaviors to improve automated behavior recognition.Front. Neurosci.17, 1198209 (2023). 23/25

  15. [15]

    & Stefanini, C

    Fazzari, E., Romano, D., Falchi, F. & Stefanini, C. Animal behavior analysis methods using deep learning: A survey.Expert. Syst. With Appl.128330 (2025)

  16. [16]

    neuroscience methods300, 166–172 (2018)

    Lorbach, M.et al.Learning to recognize rat social behavior: Novel dataset and cross-dataset application.J. neuroscience methods300, 166–172 (2018)

  17. [17]

    & Gilbert, A

    Fish, E., Weinbren, J. & Gilbert, A. Two-stream transformer architecture for long video understanding. arXiv preprint arXiv:2208.01753(2022). 20.Vaswani, A.et al.Attention is all you need.Adv. neural information processing systems30(2017)

  18. [18]

    I., Caragea, D

    Sharif, M. I., Caragea, D. & Iqbal, A. Rodent social behavior recognition using a global context-aware vision transformer network.AI6, 264 (2025)

  19. [19]

    InProceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021)

    Liu, Z.et al.Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021)

  20. [20]

    InProceedings of the IEEE/CVF international conference on computer vision, 6836–6846 (2021)

    Arnab, A.et al.Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, 6836–6846 (2021)

  21. [21]

    J.et al.The multi-agent behavior dataset: Mouse dyadic social interactions.Adv

    Sun, J. J.et al.The multi-agent behavior dataset: Mouse dyadic social interactions.Adv. neural information processing systems2021, 1 (2021)

  22. [22]

    & Sun, G

    Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. InProceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141 (2018)

  23. [23]

    & Duan, F

    Ru, Z. & Duan, F. Hierarchical spatial–temporal window transformer for pose-based rodent behavior recognition.IEEE Transactions on Instrumentation Meas.73, 1–14 (2024)

  24. [24]

    & Lin, D

    Yan, S., Xiong, Y . & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI Conference on Artificial Intelligence(2018)

  25. [25]

    & Ouyang, W

    Liu, Z., Zhang, H., Chen, Z., Wang, Z. & Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 143–152 (2020)

  26. [26]

    InProceedings of the IEEE/CVF International Conference on Computer Vision, 13359– 13368 (2021)

    Chen, Y .et al.Channel-wise topology refinement graph convolution for skeleton-based action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, 13359– 13368 (2021)

  27. [27]

    & Van Gool, L

    Huang, Z., Wan, C., Probst, T. & Van Gool, L. Spatial-temporal graph attention networks for skeleton-based action recognition.J. Electron. Imaging29(2020)

  28. [28]

    Anderson, D. J. & Perona, P. Toward a science of computational ethology.Neuron84, 18–31 (2014)

  29. [29]

    Methods9, 410–417 (2012)

    de Chaumont, F.et al.Computerized video analysis of social interactions in mice.Nat. Methods9, 410–417 (2012). 33.Yu, X. e. a. Automated home-cage behavioral phenotyping of mice.Nat. Commun.1, 1–9 (2009)

  30. [30]

    Giancardo, L.et al.Automatic visual tracking and social behaviour analysis with multiple mice.PloS one8, e74557 (2013)

  31. [31]

    P., Dollár, P., Lin, D., Anderson, D

    Burgos-Artizzu, X. P., Dollár, P., Lin, D., Anderson, D. J. & Bhatt Perona, P. Social behavior recogni- tion in continuous video. In2012 IEEE Conference on Computer Vision and Pattern Recognition, 1322–1329 (IEEE, 2012)

  32. [32]

    Mathis, M. W. & Mathis, A. Deep learning tools for the measurement of animal behavior in neuroscience.Curr. opinion neurobiology60, 1–11 (2020). 24/25

  33. [33]

    & Haffner, P

    LeCun, Y ., Bottou, L., Bengio, Y . & Haffner, P. Gradient-based learning applied to document recognition.Proc. IEEE86, 2278–2324 (1998). 38.Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780 (1997)

  34. [34]

    Reports15, 4982, DOI: 10.1038/s41598-025-87752-8 (2025)

    Chen, D., Chen, M., Wu, P.et al.Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition.Sci. Reports15, 4982, DOI: 10.1038/s41598-025-87752-8 (2025)

  35. [35]

    Mlost, J., Dawli, R., Liu, X., Costa, A. R. & Dorocic, I. P. Evaluation of unsupervised learning algorithms for the classification of behavior from pose estimation data.Patterns6(2025)

  36. [36]

    R.et al.Simple behavioral analysis (simba)–an open source toolkit for computer classification of complex social behaviors in experimental animals.BioRxiv2020–04 (2020)

    Nilsson, S. R.et al.Simple behavioral analysis (simba)–an open source toolkit for computer classification of complex social behaviors in experimental animals.BioRxiv2020–04 (2020)

  37. [37]

    Luxem, K.et al.Identifying behavioral structure from deep variational embeddings of animal motion. Commun. Biol.5, 1267 (2022)

  38. [38]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929(2020)

  39. [39]

    InThe Fourteenth International Conference on Learning Representations(2026)

    Wang, Y .et al.Animal behavioral analysis and neural encoding with transformer-based self-supervised pretraining. InThe Fourteenth International Conference on Learning Representations(2026)

  40. [40]

    machine Learn

    Pedregosa, F.et al.Scikit-learn: Machine learning in python.J. machine Learn. research12, 2825–2830 (2011)

  41. [41]

    neural information processing systems32(2019)

    Paszke, A.et al.Pytorch: An imperative style, high-performance deep learning library.Adv. neural information processing systems32(2019). 47.Fan, H.et al.Multiscale vision transformers. InICCV(2021)

  42. [42]

    & Liu, H

    Chen, Z., Li, S., Yang, B., Li, Q. & Liu, H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition.AAAI(2021)

  43. [43]

    Bai, S., Kolter, J. Z. & Koltun, V . An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arxiv.arXiv preprint arXiv:1803.0127110(2018)

  44. [44]

    & Paliwal, K

    Schuster, M. & Paliwal, K. K. Bidirectional recurrent neural networks.IEEE Transactions on Signal Process.(1997). 25/25