arxiv: 2605.06351 · v1 · submitted 2026-05-07 · 💻 cs.HC

Recognition: unknown

SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition

Guangchao Li, Guangrong Zhao, Hongkai Wen, Qi Lin, Wen Ma, Xiaofang Xiao, Yanxiang Wang, Yiran Shen

Pith reviewed 2026-05-08 06:59 UTC · model grok-4.3

classification 💻 cs.HC

keywords sign language recognitionmultimodal datasetASLsensor fusionRGB-D camerammWave radarIMUhuman-computer interaction

0 comments

The pith

A new multimodal dataset integrates video, radar, and wrist sensors to advance sign language recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present SIGMA-ASL as a dataset for automatic sign language recognition that combines data from an RGB-D camera, a millimeter-wave radar, and two wrist-worn IMUs. This setup captures visual, radio, and motion information in 93,545 synchronized clips of 160 ASL signs performed by 20 participants in a studio. Existing vision-based datasets suffer from lighting issues, occlusions, and privacy problems, so this multimodal approach aims to provide complementary signals for more reliable recognition. The work includes alignment methods, preprocessing, and benchmarks for both user-dependent and user-independent testing to facilitate research in sensor fusion for SLR.

Core claim

The paper introduces the SIGMA-ASL dataset, collected with 20 participants performing 160 common ASL signs, resulting in 93,545 temporally synchronized word-level multimodal clips from RGB-D, mmWave radar, and IMUs, along with a unified sensing framework for millisecond-level alignment and standardized evaluation protocols.

What carries the argument

The unified sensing framework that synchronizes the outputs of the RGB-D camera, mmWave radar, and two wrist IMUs at millisecond precision to enable sensor fusion.

If this is right

Multimodal models trained on the dataset can achieve higher accuracy than vision-only approaches by using complementary radio and kinematic data.
Privacy is enhanced because radar reflections do not reveal visual details like faces.
The dataset supports both user-dependent and user-independent evaluations to test generalization across signers.
Standardized pipelines allow consistent benchmarking of single-modality and fused recognition systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could test whether models trained on this studio data perform well in uncontrolled environments with varying lighting or backgrounds.
The combination of sensors might enable recognition in scenarios where cameras are impractical due to privacy or lighting constraints.
Extending the dataset to sentence-level recognition or additional sign languages could broaden its utility.

Load-bearing premise

Data collected from 20 participants in a controlled studio environment will generalize to support robust sign language recognition systems in real-world conditions.

What would settle it

A test showing that adding radar and IMU data to vision models yields no improvement in recognition accuracy on held-out real-world sign language videos.

Figures

Figures reproduced from arXiv: 2605.06351 by Guangchao Li, Guangrong Zhao, Hongkai Wen, Qi Lin, Wen Ma, Xiaofang Xiao, Yanxiang Wang, Yiran Shen.

**Figure 1.** Figure 1: The distributions of semantic categories (a) and POS (b) of the selected vocabulary. view at source ↗

**Figure 2.** Figure 2: Data collection setup and sensor deployment. view at source ↗

**Figure 3.** Figure 3: Landmarks are extracted from RGB frames and the offsets between the key points (wrist-hip) are used to find the view at source ↗

**Figure 4.** Figure 4: Radar module overview: (a) hardware configuration parameters: (TI IWR1443 FMCW mmWave) and (b) the pre view at source ↗

**Figure 5.** Figure 5: Number of clips per class. This figure illustrates the distribution of the number of clips across different classes in the view at source ↗

**Figure 6.** Figure 6: Number of clips per participant. This figure shows the distribution of the number of clips across participants in the view at source ↗

**Figure 7.** Figure 7: (a) Distribution of word-clip durations in the view at source ↗

**Figure 8.** Figure 8: Examples of ambiguous words visualized in RGB sequences and IMU signals. view at source ↗

read the original abstract

Automatic sign language recognition (SLR) has become a key enabler of inclusive human-computer interaction, fostering seamless communication between deaf individuals and hearing communities. Despite significant advances in multimodal learning, existing SLR research remains dominated by vision-based datasets, which are limited by sensitivity to lighting and occlusion, privacy concerns, and a lack of cross-modal diversity. To address these challenges, we introduce SIGMA-ASL, a large-scale multimodal dataset for SLR. The dataset integrates an Azure Kinect RGB-D camera, a millimeter-wave (mmWave) radar, and two wrist-worn inertial measurement units (IMUs) to capture complementary visual, radio-reflection, and kinematic information. Collected in a controlled studio environment with 20 participants performing 160 common American sign language (ASL) signs, SIGMA-ASL provides 93,545 temporally synchronized word-level multimodal clips. A unified sensing framework achieves millisecond-level alignment across modalities, enabling reliable sensor fusion and cross-modal learning. We further design standardized preprocessing pipelines and benchmarking protocols under both user-dependent and user-independent settings, offering a comprehensive foundation for evaluating single and multimodal SLR. Extensive experiments validate the dataset's quality and demonstrate its potential as a valuable resource for developing robust, privacy-preserving, and ubiquitous sign language recognition systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIGMA-ASL provides a new multimodal ASL dataset with radar and IMUs but its studio-only setup and lack of ablations limit what it proves about robustness.

read the letter

The key takeaway on SIGMA-ASL is that it offers a previously unseen combination of sensors for sign language data—Azure Kinect RGB-D plus mmWave radar and wrist IMUs—from 20 signers doing 160 ASL words, with nearly 94k aligned clips and some basic benchmarks. What the authors have done reasonably well is set up the collection and synchronization. Getting temporal alignment across these different modalities to the millisecond level is useful engineering, and laying out preprocessing pipelines plus user-dependent and independent evaluation protocols gives the community a starting point for experiments. Dataset papers that ship clear protocols tend to see more adoption. Where it falls short is in supporting the broader claims. The entire collection happened in a single controlled studio, so there's no evidence yet that models trained on this will hold up under real lighting changes, occlusions, or diverse signer backgrounds. With just 20 participants, the user-independent split probably doesn't capture enough variation to prove generalization. More importantly, the paper asserts complementarity from the extra sensors but doesn't show any ablation results comparing multimodal performance against RGB-D alone on the same data. Without that, it's hard to know if the radar and IMUs are worth the added complexity. This work is aimed at HCI and computer vision researchers focused on accessible interfaces and sign language tech. Someone looking for a new multimodal corpus to experiment with fusion methods might find it worth downloading once available. I'd push for peer review here. The core contribution is the dataset itself, and referees can help assess the collection rigor and whether the benchmarks are set up fairly. Even with the limitations in scope, it has enough substance to go through that process rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces SIGMA-ASL, a multimodal dataset for American Sign Language recognition that integrates Azure Kinect RGB-D, mmWave radar, and wrist-worn IMUs. It describes collection of 93,545 temporally synchronized word-level clips from 20 participants performing 160 ASL signs in a controlled studio, along with a unified sensing framework for millisecond-level alignment, standardized preprocessing pipelines, and benchmarking protocols under user-dependent and user-independent settings. The authors report extensive experiments validating dataset quality and demonstrating potential for robust, privacy-preserving SLR systems.

Significance. If the released dataset, alignment protocols, and benchmarks prove reliable, SIGMA-ASL would provide a valuable resource for multimodal SLR research by moving beyond vision-only datasets and enabling sensor-fusion studies. The standardized splits and preprocessing are positive for reproducibility, though the controlled collection limits immediate claims about real-world robustness.

major comments (3)

[Abstract] Abstract: The central claim that SIGMA-ASL enables 'robust, privacy-preserving, and ubiquitous' SLR systems is not load-bearingly supported. All 93,545 clips were acquired in one controlled studio environment with only 20 signers; the manuscript reports no cross-environment testing, external-cohort validation, or evaluation under lighting changes, occlusions, or background clutter that real-world SLR must handle.
[Abstract] Abstract and experiments section: Complementarity of the mmWave radar and IMU streams over RGB-D alone is asserted but not isolated. No ablation results are provided showing measurable accuracy or robustness gains from the additional modalities on the same user-dependent and user-independent splits, leaving the value of the multimodal setup as an assumption rather than a demonstrated result.
[Benchmarking protocols] Benchmarking protocols: The user-independent split risks capturing only studio-specific patterns rather than true signer generalization, because the entire corpus was collected under identical controlled conditions with no variation in environment or signer cohort.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit discussion of dataset release status, licensing, and access procedures to maximize utility for the community.
[Data collection] Clarify the exact number of repetitions per sign per participant and any quality-control steps applied during collection to allow readers to assess clip diversity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below, indicating where we will revise the manuscript to improve clarity and accuracy while maintaining the integrity of the reported work.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that SIGMA-ASL enables 'robust, privacy-preserving, and ubiquitous' SLR systems is not load-bearingly supported. All 93,545 clips were acquired in one controlled studio environment with only 20 signers; the manuscript reports no cross-environment testing, external-cohort validation, or evaluation under lighting changes, occlusions, or background clutter that real-world SLR must handle.

Authors: We agree that the phrasing in the abstract overstates the immediate applicability to real-world conditions. The dataset was collected in a single controlled studio, and the paper does not include cross-environment or external validation experiments. In the revised version we will modify the abstract to describe SIGMA-ASL as a large-scale multimodal resource that provides standardized protocols for studying sensor fusion in SLR, while explicitly noting the controlled acquisition setting and the need for future work on environmental robustness. We will also add a limitations paragraph in the discussion section. revision: yes
Referee: [Abstract] Abstract and experiments section: Complementarity of the mmWave radar and IMU streams over RGB-D alone is asserted but not isolated. No ablation results are provided showing measurable accuracy or robustness gains from the additional modalities on the same user-dependent and user-independent splits, leaving the value of the multimodal setup as an assumption rather than a demonstrated result.

Authors: The experiments section reports both single-modal and fused multimodal results, yet we acknowledge that dedicated ablation tables isolating the incremental benefit of radar and IMU over RGB-D on identical splits were not presented. We will add these ablation experiments to the revised manuscript, evaluating RGB-D alone versus RGB-D+radar, RGB-D+IMU, and full fusion under both the user-dependent and user-independent protocols to quantify the observed gains. revision: yes
Referee: [Benchmarking protocols] Benchmarking protocols: The user-independent split risks capturing only studio-specific patterns rather than true signer generalization, because the entire corpus was collected under identical controlled conditions with no variation in environment or signer cohort.

Authors: We concur that the user-independent split evaluates signer variation within the same studio environment and therefore does not capture environmental or cohort diversity. In the revision we will clarify this distinction in the benchmarking protocols section, explicitly stating that the split measures signer generalization under fixed conditions and that broader environmental variation remains an important direction for future dataset extensions. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset paper with purely descriptive contributions

full rationale

This is a dataset introduction paper with no mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems. The central claims concern data collection protocols, sensor synchronization, and standardized benchmarks under user-dependent and user-independent splits. These are empirical descriptions of the SIGMA-ASL corpus (20 participants, 160 signs, 93,545 clips) rather than any chain that reduces a result to its own inputs by construction. No self-citations are invoked to justify load-bearing premises, and no ansatzes or renamings of known results occur. The paper is self-contained as a resource contribution; any limitations around generalization or sensor complementarity are questions of external validation, not internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-collection and benchmarking paper; no free parameters, mathematical axioms, or new postulated entities are introduced beyond standard assumptions about sensor complementarity and controlled-environment data utility.

pith-pipeline@v0.9.0 · 5542 in / 1201 out tokens · 21279 ms · 2026-05-08T06:59:46.989278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

114 extracted references · 17 canonical work pages · 5 internal anchors

[1]

Subhash Chand Agrawal, Anand Singh Jalal, and Rajesh Kumar Tripathi. 2016. A survey on manual and non-manual sign language recognition for isolated and continuous sign.International Journal of Applied Pattern Recognition3, 2 (2016), 99–134

2016
[2]

Mubarak A Alanazi, Abdullah K Alhazmi, Osama Alsattam, Kara Gnau, Meghan Brown, Shannon Thiel, Kurt Jackson, and Vamsy P Chodavarapu. 2022. Towards a low-cost solution for gait analysis using millimeter wave sensor and machine learning.Sensors22, 15 (2022), 5470

2022
[3]

Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, and Andrew Zisserman. 2020. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. InEuropean conference on computer vision. Springer, 35–53

2020
[4]

Sarah Alyami, Hamzah Luqman, and Mohammad Hammoudeh. 2024. Reviewing 25 years of continuous sign language recognition research: Advances, challenges, and prospects.Information Processing & Management61, 5 (2024), 103774

2024
[5]

Sizhe An, Yin Li, and Umit Ogras. 2022. mri: Multi-modal 3d human pose estimation dataset using mmwave, rgb-d, and inertial sensors. Advances in neural information processing systems35 (2022), 27414–27426

2022
[6]

Vassilis Athitsos, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Quan Yuan, and Ashwin Thangali. 2008. The american sign language lexicon video dataset. In2008 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, 1–8

2008
[7]

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271(2018)

work page internal anchor Pith review arXiv 2018
[8]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. InIcml, Vol. 2. 4

2021
[9]

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7784–7793

2018
[10]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308

2017
[11]

Xiujuan Chai, Guang Li, Yushun Lin, Zhihao Xu, Yili Tang, Xilin Chen, and Ming Zhou. 2013. Sign language recognition and translation with kinect. InIEEE conf. on AFGR, Vol. 655. 4

2013
[12]

Wenqiang Chen, Lin Chen, Yandao Huang, Xinyu Zhang, Lu Wang, Rukhsana Ruby, and Kaishun Wu. 2019. Taprint: Secure text input for commodity smart wristbands. InThe 25th Annual International Conference on Mobile Computing and Networking. 1–16

2019
[13]

Wenqiang Chen, Lin Chen, Meiyi Ma, Farshid Salemi Parizi, Shwetak Patel, and John Stankovic. 2021. ViFin: Harness passive vibration to continuous micro finger writing with a commodity smartwatch.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies5, 1 (2021), 1–25

2021
[14]

Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training.IEEE Transactions on Multimedia21, 7 (2019), 1880–1891

2019
[15]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255

2009
[16]

Kaikai Deng, Dong Zhao, Qiaoyue Han, Zihan Zhang, Shuyue Wang, Anfu Zhou, and Huadong Ma. 2023. Midas: Generating mmwave radar data from videos for training pervasive and privacy-preserving human sensing tasks.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7, 1 (2023), 1–26

2023
[17]

Kaikai Deng, Dong Zhao, Wenxin Zheng, Yue Ling, Kangwen Yin, and Huadong Ma. 2024. G 3 R: Generating Rich and Fine-Grained mmWave Radar Data From 2D Videos for Generalized Gesture Recognition.IEEE Transactions on Mobile Computing(2024)

2024
[18]

Aashaka Desai, Lauren Berger, Fyodor Minakov, Nessa Milano, Chinmay Singh, Kriston Pumphrey, Richard Ladner, Hal Daumé III, Alex X Lu, Naomi Caselli, et al. 2023. ASL citizen: a community-sourced dataset for advancing isolated sign language recognition. Advances in Neural Information Processing Systems36 (2023), 76893–76907

2023
[19]

Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. 2021. Local temperature scaling for probability calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6889–6899

2021
[20]

Yasmine Djebrouni, Nawel Benarba, Ousmane Touat, Pasquale De Rosa, Sara Bouchenak, Angela Bonifati, Pascal Felber, Vania Marangozova, and Valerio Schiavoni. 2024. Bias mitigation in federated learning for edge computing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7, 4 (2024), 1–35. SIGMA-ASL: Sensor-Integrated Multim...

2024
[21]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review arXiv 2020
[22]

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. 2022. Revisiting skeleton-based action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2969–2978

2022
[23]

Sarah Ebling, Necati Cihan Camgöz, Penny Boyes Braem, Katja Tissi, Sandra Sidler-Miserez, Stephanie Stoll, Simon Hadfield, Tobias Haug, Richard Bowden, Sandrine Tornay, et al. 2018. SMILE Swiss German sign language dataset. InProceedings of the 11th international conference on language resources and evaluation (LREC) 2018. The European Language Resources ...

2018
[24]

Fatima Elhattab, Sara Bouchenak, and Cédric Boscher. 2024. Pastel: Privacy-preserving federated learning in edge computing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7, 4 (2024), 1–29

2024
[25]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision. 6202–6211

2019
[26]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 1933–1941

2016
[27]

Jérôme Fink, Benoît Frénay, Laurence Meurant, and Anthony Cleve. 2021. Lsfb-cont and lsfb-isol: Two new datasets for vision-based sign language recognition. In2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

2021
[28]

Jens Forster, Christoph Schmidt, Thomas Hoyoux, Oscar Koller, Uwe Zelle, Justus H Piater, and Hermann Ney. 2012. RWTH-PHOENIX- weather: A large vocabulary sign language recognition and translation corpus.. InLREC, Vol. 9. 3785–3789

2012
[29]

Chenyang Gao, Ivan Marsic, Aleksandra Sarcevic, Waverly Gestrich-Thompson, and Randall S Burd. 2023. Real-time context-aware multimodal network for activity and activity-stage recognition from team communication in dynamic clinical settings.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies7, 1 (2023), 1–28

2023
[30]

Haodong Guo, Ling Chen, Liangying Peng, and Gencai Chen. 2016. Wearable sensor based multimodal human activity recognition exploiting the diversity of classifier ensemble. InProceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing. 1112–1123

2016
[31]

Sevgi Z Gurbuz, Ali C Gurbuz, Evie A Malaia, Darrin J Griffin, Chris Crawford, M Mahbubur Rahman, Ridvan Aksu, Emre Kurtoglu, Robiulhossain Mdrafi, Ajaymehul Anbuselvam, et al. 2020. A linguistic perspective on radar micro-Doppler analysis of American sign language. In2020 IEEE international radar conference (RADAR)

2020
[32]

Sevgi Z Gurbuz, Ali Cafer Gurbuz, Evie A Malaia, Darrin J Griffin, Chris S Crawford, Mohammad Mahbubur Rahman, Emre Kurtoglu, Ridvan Aksu, Trevor Macks, and Robiulhossain Mdrafi. 2020. American sign language recognition using rf sensing.IEEE Sensors Journal21, 3 (2020), 3763–3775

2020
[33]

Eva Gutierrez-Sigut, Brendan Costello, Cristina Baus, and Manuel Carreiras. 2016. LSE-sign: A lexical database for spanish sign language.Behavior research methods48, 1 (2016), 123–137

2016
[34]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555

2018
[35]

Jiahui Hou, Xiang-Yang Li, Peide Zhu, Zefan Wang, Yu Wang, Jianwei Qian, and Panlong Yang. 2019. Signspeaker: A real-time, high-precision smartwatch-based sign language translator. InThe 25th Annual International Conference on Mobile Computing and Networking. 1–15

2019
[36]

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2018. Attention-based 3D-CNNs for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology29, 9 (2018), 2822–2832

2018
[37]

Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

2018
[38]

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. 2024. Llm maybe longlm: Self-extend llm context window without tuning.arXiv preprint arXiv:2401.01325(2024)

work page arXiv 2024
[39]

Point-of-Care

Yincheng Jin, Shibo Zhang, Yang Gao, Xuhai Xu, Seokmin Choi, Zhengxiong Li, Henry J Adler, and Zhanpeng Jin. 2023. SmartASL: " Point-of-Care" Comprehensive ASL Interpreter Using Wearables.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7, 2 (2023), 1–21

2023
[40]

Hamid Reza Vaezi Joze and Oscar Koller. 2018. Ms-asl: A large-scale data set and benchmark for understanding american sign language. arXiv preprint arXiv:1812.01053(2018)

work page arXiv 2018
[41]

Payal Kamboj, Ayan Banerjee, and Sandeep KS Gupta. 2023. Transfer: Cross Modality Knowledge Transfer using Adversarial Networks–A Study on Gesture Recognition.arXiv preprint arXiv:2306.15114(2023)

work page arXiv 2023
[42]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950(2017)

work page internal anchor Pith review arXiv 2017
[43]

Sara Askari Khomami and Sina Shamekhi. 2021. Persian sign language recognition using IMU and surface EMG sensors.Measurement 168 (2021), 108471

2021
[44]

Oscar Koller. 2020. Quantitative survey of the state of the art in sign language recognition.arXiv preprint arXiv:2008.09918(2020). 24•Xiao et al

work page arXiv 2020
[45]

Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. 2016. Deep sign: Hybrid CNN-HMM for continuous sign language recognition. InProceedings of the British Machine Vision Conference 2016

2016
[46]

Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. 2019. Mmact: A large-scale dataset for cross modal human action understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 8658–8667

2019
[47]

Shengchang Lan, Linting Ye, and Kang Zhang. 2023. Applying mmWave radar sensors to vocabulary-level dynamic Chinese sign language recognition for the community with deafness and hearing loss.IEEE Sensors Journal23, 22 (2023), 27273–27283

2023
[48]

Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David Crandall, and Dhruv Batra. 2016. Stochastic multiple choice learning for training diverse deep ensembles.Advances in Neural Information Processing Systems29 (2016)

2016
[49]

Dongxu Li, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1459–1469

2020
[50]

Jiyang Li, Lin Huang, Siddharth Shah, Sean J Jones, Yincheng Jin, Dingran Wang, Adam Russell, Seokmin Choi, Yang Gao, Junsong Yuan, et al. 2023. Signring: Continuous american sign language recognition using imu rings and virtual imu data.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7, 3 (2023), 1–29

2023
[51]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

2023
[52]

Mingzhi Lin, Teng Huang, Han Ding, Cui Zhao, Fei Wang, Ge Wang, and Wei Xi. 2025. Active Domain Adaptation for mmWave-based HAR via Renyi Entropy-based Uncertainty Estimation.arXiv preprint arXiv:2511.04219(2025)

work page arXiv 2025
[53]

Bingyan Liu, Yuanchun Li, Yunxin Liu, Yao Guo, and Xiangqun Chen. 2020. Pmc: A privacy-preserving deep learning model customization framework for edge computing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies4, 4 (2020), 1–25

2020
[54]

Chunhui Liu, Yueyu Hu, Yanghao Li, Sijie Song, and Jiaying Liu. 2017. Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv 2017.arXiv preprint arXiv:1703.07475(2017)

work page arXiv 2017
[55]

Haipeng Liu, Kening Cui, Kaiyuan Hu, Yuheng Wang, Anfu Zhou, Liang Liu, and Huadong Ma. 2022. mtranssee: Enabling environment- independent mmwave sensing based gesture recognition via transfer learning.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies6, 1 (2022), 1–28

2022
[56]

Xinwang Liu, Xinzhong Zhu, Miaomiao Li, Lei Wang, Chang Tang, Jianping Yin, Dinggang Shen, Huaimin Wang, and Wen Gao. 2018. Late fusion incomplete multi-view clustering.IEEE transactions on pattern analysis and machine intelligence41, 10 (2018), 2410–2423

2018
[57]

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172 (2019)

work page internal anchor Pith review arXiv 2019
[58]

Aleix M Martínez, Ronnie B Wilbur, Robin Shay, and Avinash C Kak. 2002. Purdue RVL-SLLL ASL database for automatic recognition of American Sign Language. InProceedings. Fourth IEEE International Conference on Multimodal Interfaces. IEEE, 167–172

2002
[59]

Microsoft. 2020. Azure Kinect DK. https://learn.microsoft.com/zh-cn/previous-versions/azure/kinect-dk/. Accessed: 2025-10-09

2020
[60]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[61]

Khanh Nguyen-Trong, Hoai Nam Vu, Ngon Nguyen Trung, and Cuong Pham. 2021. Gesture recognition using wearable sensors with bi-long short-term memory convolutional neural networks.IEEE Sensors Journal21, 13 (2021), 15065–15079

2021
[62]

Eng-Jon Ong, Helen Cooper, Nicolas Pugeault, and Richard Bowden. 2012. Sign language recognition using sequential pattern trees. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2200–2207

2012
[63]

Oğulcan Özdemir, Ahmet Alp Kındıroğlu, Necati Cihan Camgöz, and Lale Akarun. 2020. Bosphorussign22k sign language recognition dataset.arXiv preprint arXiv:2004.01283(2020)

work page arXiv 2020
[64]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

2018
[65]

Ekkasit Pinyoanuntapong, Ayman Ali, Kalvik Jakkala, Pu Wang, Minwoo Lee, Qucheng Peng, Chen Chen, and Zhi Sun. 2023. Gaitsada: Self-aligned domain adaptation for mmwave gait recognition. In2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems (MASS). IEEE, 218–226

2023
[66]

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise document reranking with open-source large language models.arXiv preprint arXiv:2309.15088(2023)

work page arXiv 2023
[67]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

2021
[68]

M Mahbubur Rahman, Robiulhossain Mdrafi, Ali C Gurbuz, Evie Malaia, Chris Crawford, Darrin Griffin, and Sevgi Z Gurbuz. 2021. Word- level sign language recognition using linguistic adaptation of 77 GHz FMCW radar data. In2021 IEEE Radar Conference (RadarConf21). IEEE, 1–6. SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition•25

2021
[69]

Franco Ronchetti, Facundo Manuel Quiroga, César Estrebou, Laura Lanzarini, and Alejandro Rosete. 2023. LSA64: An Argentinian sign language dataset.arXiv preprint arXiv:2310.17429(2023)

work page arXiv 2023
[70]

Panneer Selvam Santhalingam, Yuanqi Du, Riley Wilkerson, Al Amin Hosain, Ding Zhang, Parth Pathak, Huzefa Rangwala, and Raja Kushalnagar. 2020. Expressive asl recognition using millimeter-wave wireless signals. In2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). IEEE, 1–9

2020
[71]

Panneer Selvam Santhalingam, Al Amin Hosain, Ding Zhang, Parth Pathak, Huzefa Rangwala, and Raja Kushalnagar. 2020. mmasl: Environment-independent asl gesture recognition using 60 ghz millimeter-wave signals.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies4, 1 (2020), 1–30

2020
[72]

Panneer Selvam Santhalingam, Parth Pathak, Huzefa Rangwala, and Jana Kosecka. 2023. Synthetic smartwatch imu data generation from in-the-wild asl videos.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7, 2 (2023), 1–34

2023
[73]

Noha Sarhan and Simone Frintrop. 2023. Unraveling a decade: A comprehensive survey on isolated sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3210–3219

2023
[74]

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1010–1019

2016
[75]

Rishi Raj Sharma, Gunupuru Aravind, and Rahul Dubey. 2023. Radar based automated system for people walk identification using correlation information and flexible analytic wavelet transform: RR Sharma et al.Applied Intelligence53, 24 (2023), 30746–30756

2023
[76]

Xin Shen, Heming Du, Hongwei Sheng, Shuyun Wang, Hui Chen, Huiqiang Chen, Zhuojie Wu, Xiaobiao Du, Jiaying Ying, Ruihan Lu, et al. 2024. MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset.Advances in Neural Information Processing Systems37 (2024), 69700–69715

2024
[77]

Xin Shen, Shaozu Yuan, Hongwei Sheng, Heming Du, and Xin Yu. 2023. Auslan-daily: Australian sign language translation for daily communication and news.Advances in Neural Information Processing Systems36 (2023), 80455–80469

2023
[78]

Jiajia Shi, Yihan Zhu, Jiaqing He, Zhihuo Xu, Liu Chu, Robin Braun, and Quan Shi. 2025. Human Activity Recognition Based on Feature Fusion of Millimeter Wave Radar and Inertial Navigation.IEEE Journal of Microwaves(2025)

2025
[79]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos.Advances in neural information processing systems27 (2014)

2014
[80]

Ozge Mercanoglu Sincan and Hacer Yalim Keles. 2020. Autsl: A large scale multi-modal turkish sign language dataset and baseline methods.IEEE access8 (2020), 181340–181355

2020

Showing first 80 references.