arxiv: 2605.00242 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

Recognition: unknown

MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video

Xijia Wei , Yuan Fang , Kevin Chetty , Youngjun Cho , Nadia Bianchi-Berthouze

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human pose estimationmmWave radarself-supervised learningmasked autoencodingspectrogram videospatiotemporal representationsprivacy-preserving sensing

0 comments

The pith

MAEPose shows masked autoencoding on unlabeled mmWave videos produces representations for accurate human pose estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that pre-training a masked autoencoder on raw unlabeled mmWave spectrogram videos learns spatiotemporal representations that support accurate multi-frame pose estimation when a heatmap decoder is added. A sympathetic reader would care because mmWave radar avoids the privacy issues of cameras while previous radar methods discarded video information through extra processing steps and needed full supervision. The work tests this on multiple datasets using leave-one-person-out validation and checks behavior when bystanders appear without any retraining on such cases. If the claim is right, radar video can be used more directly for pose tasks with less labeling effort and better robustness in everyday settings.

Core claim

MAEPose is a masked autoencoding approach that operates directly on mmWave spectrogram videos. It learns spatiotemporal motion-aware generalized representations from unlabeled radar video streams and then applies a heatmap decoder to produce multi-frame pose predictions. Across three datasets the method outperforms prior baselines, and accuracy drops only modestly when bystanders are introduced in a zero-shot manner. Ablation results indicate that both the pre-training stage and the heatmap decoder contribute to performance, while Range-Doppler video input gives stronger results than Range-Azimuth or fused inputs at lower cost.

What carries the argument

Masked autoencoding pre-training on unlabeled mmWave spectrogram videos to acquire spatiotemporal representations, followed by a heatmap decoder that converts those representations into joint position predictions over multiple frames.

If this is right

Pose estimation becomes possible directly from raw radar video streams without intermediate point-cloud or image extraction steps that discard information.
Unlabeled radar video data can be used at scale to improve representations instead of requiring fully supervised training on every new setting.
The same model maintains most of its accuracy when bystanders enter the scene even though it was never trained on interference examples.
Range-Doppler video input yields better pose accuracy than Range-Azimuth or combined modalities while using fewer computational resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-supervised pre-training could be reused for other radar-based tasks such as activity recognition or fall detection without collecting new pose labels.
Real-world deployment would benefit from checking whether the multi-frame predictions remain stable over long continuous recordings rather than short clips.
Small amounts of labeled data might be sufficient to adapt the decoder for new radar hardware, potentially beating training a full model from scratch.

Load-bearing premise

The representations learned by reconstructing masked portions of unlabeled mmWave videos transfer effectively to accurate joint position prediction with the heatmap decoder across different people, datasets, and interference conditions.

What would settle it

An experiment on a new mmWave dataset collected with different hardware or in a different environment that finds MAEPose error rates equal to or higher than supervised baselines, particularly under bystander interference, would show the learned representations do not generalize as claimed.

Figures

Figures reproduced from arXiv: 2605.00242 by Kevin Chetty, Nadia Bianchi-Berthouze, Xijia Wei, Youngjun Cho, Yuan Fang.

**Figure 1.** Figure 1: Overview of the MAEPose architecture. The model processes sequences of Range-Doppler radar spectrograms through [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 1.** Figure 1: In Stage 1 (Self-Supervised Pretraining), a video-based masked reconstruction task is used for training MAEPose to learn the spatiotemporal representation without the need for human pose annotations. The task aims to train MAEPose to reconstruct the mmWave video spectrogram patches given a partially masked mmWave video. During pre-training, a video ViT (Vision Transformer) encoder extracts the spatiotemp… view at source ↗

**Figure 2.** Figure 2: The multi-sensory data collection platform and the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Radar signal processing pipeline from raw signals [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Action-level model performances visualization across three datasets. Results are based on MPJPE where lower values [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A visualization of pose estimation result from all [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative MAEPose reconstruction results on an [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods typically rely on pre-extracted intermediate representations such as sparse point clouds or spectrogram images, where the rich spatiotemporal information naturally present in radar video streams is discarded for model learning, while such signal processing adds system complexity. In addition, existing solutions are mainly conducted in an end-to-end supervised manner without leveraging unlabelled raw video streams to learn generalized representations. In this study, we present MAEPose, a masked autoencoding-based human pose estimation approach that operates directly on mmWave spectrogram videos. MAEPose learns spatiotemporal motion-aware generalized representations from unlabelled radar video, and leverages its heatmap decoder for multi-frame pose estimation predictions. We evaluate it across three datasets based on leave-one-person-out cross-validation with rigorous statistical testing. MAEPose consistently outperforms state-of-the-art baselines by up to 22.1% in MPJPE p<0.05, and maintains robust accuracy under zero-shot bystander interference with only a 6.5% error increase. Ablation studies confirm that both the pre-training and the heatmap decoder contribute substantially, while modality analysis indicates that leveraging Range-Doppler video as input achieves better pose estimation performance than Range-Azimuth or their fusion, with lower computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAEPose shows masked autoencoding on mmWave spectrogram videos can improve pose estimation over prior radar baselines, with decent multi-dataset experiments.

read the letter

The main point is that MAEPose applies masked autoencoding pre-training directly to unlabeled mmWave spectrogram videos for human pose estimation, then uses a decoder for multi-frame predictions. This is a step beyond the usual point-cloud or supervised approaches in radar sensing. What stands out is the evaluation. They run leave-one-person-out validation across three datasets, test for statistical significance, and include ablations plus a zero-shot interference scenario. The reported gains of up to 22.1% in MPJPE with p<0.05, plus the small drop under bystander conditions, give the claims some grounding. The finding that Range-Doppler video works better than other modalities with less compute is practical. A couple of soft spots: the input still goes through spectrogram processing, so it's not raw ADC data. The generalization claims rest on the specific datasets and interference simulation; real-world variation in radar hardware or environments could change the picture. The abstract contrasts with prior work, but I'd check the related work section for how novel the exact combination really is. This paper is for researchers in privacy-preserving sensing and self-supervised methods for non-camera data. The experimental design looks solid enough that it should go to peer review, where referees can dig into the implementation details and confirm the numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MAEPose, a masked autoencoding self-supervised framework that pre-trains on unlabelled mmWave Range-Doppler spectrogram videos to learn spatiotemporal representations, then applies a heatmap decoder for multi-frame human pose estimation. It reports evaluation across three datasets via leave-one-person-out cross-validation, with up to 22.1% MPJPE improvement over baselines (p<0.05), only 6.5% error increase under zero-shot bystander interference, ablations confirming contributions from pre-training and the decoder, and modality analysis favoring Range-Doppler input.

Significance. If the quantitative results hold under full scrutiny, the work advances privacy-preserving pose estimation by operating directly on raw radar video streams without intermediate point clouds or supervised labels. Strengths include the self-supervised pre-training pipeline, statistical testing, ablations, and interference robustness test, which together provide grounding for claims of generalization beyond supervised radar methods.

major comments (2)

[§4 (Evaluation)] §4 (Evaluation): The leave-one-person-out protocol and cross-dataset claims are central to the generalization argument, but the manuscript must specify dataset statistics (subject counts, radar configurations, environment variations) and exact construction of the zero-shot bystander interference test to confirm independence from training distributions.
[§3.2 (Pre-training)] §3.2 (Pre-training): The masked autoencoding details (masking ratio, patch size, reconstruction objective) are load-bearing for the claim that unlabelled video yields motion-aware representations; without these, it is difficult to assess why the learned features transfer to the heatmap decoder better than alternatives.

minor comments (2)

The abstract and introduction alternate between 'mmWave spectrogram videos' and 'Range-Doppler video'; adopt consistent terminology from the first mention and define the input tensor shape explicitly.
Table reporting MPJPE results should include per-comparison standard deviations and the exact statistical test (e.g., paired t-test) used to obtain p<0.05.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We have addressed both major comments by expanding the relevant sections with the requested details, which we believe strengthens the manuscript's clarity and rigor without altering any claims or results.

read point-by-point responses

Referee: [§4 (Evaluation)] The leave-one-person-out protocol and cross-dataset claims are central to the generalization argument, but the manuscript must specify dataset statistics (subject counts, radar configurations, environment variations) and exact construction of the zero-shot bystander interference test to confirm independence from training distributions.

Authors: We agree that explicit dataset statistics and test construction details are necessary to fully support the generalization claims. In the revised manuscript, we have added a dedicated paragraph and summary table in §4 that reports subject counts (e.g., 12, 9, and 15 subjects across the three datasets), radar configurations (60 GHz carrier, 4 GHz bandwidth, 30 fps frame rate), and environment variations (indoor controlled, semi-outdoor, and cluttered real-world settings). For the zero-shot bystander interference test, we now describe its exact construction: it uses entirely new subjects and activity sequences recorded in previously unseen environments, with bystanders introduced as additional dynamic reflectors; no training or pre-training data from these subjects or scenes is used, ensuring distributional independence. revision: yes
Referee: [§3.2 (Pre-training)] The masked autoencoding details (masking ratio, patch size, reconstruction objective) are load-bearing for the claim that unlabelled video yields motion-aware representations; without these, it is difficult to assess why the learned features transfer to the heatmap decoder better than alternatives.

Authors: We appreciate this observation. While the high-level architecture was described, the specific hyperparameters were not stated with sufficient precision. In the revised §3.2 we now explicitly report a 75% masking ratio, 16×16 spatial patches extended across 4-frame temporal windows, and an MSE reconstruction objective applied only to masked tokens. These choices force the encoder to infer missing spatiotemporal motion cues from raw Range-Doppler video, which directly benefits transfer to the heatmap decoder; we have also added a brief justification linking these design decisions to the observed performance gains over alternative pre-training strategies in our ablations. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a standard masked-autoencoding pre-training pipeline on unlabeled mmWave Range-Doppler spectrogram videos, followed by fine-tuning a heatmap decoder for multi-frame pose regression. Evaluation relies on leave-one-person-out cross-validation across three independent datasets, p<0.05 statistical testing, ablations isolating pre-training and decoder contributions, and a zero-shot bystander-interference test. These elements are externally grounded and do not reduce any claimed prediction or representation to the inputs by construction, nor do they depend on self-citation chains or definitional loops. The central claim that learned representations generalize is tested rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions in self-supervised learning and radar signal processing, but specific details on hyperparameters or model architecture are not provided in the abstract.

axioms (1)

domain assumption Masked autoencoding can learn useful spatiotemporal representations from radar spectrogram videos
This is the core assumption of the pre-training approach.

pith-pipeline@v0.9.0 · 5557 in / 1255 out tokens · 51074 ms · 2026-05-09T19:52:48.650959+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254(2021)

work page internal anchor Pith review arXiv 2021
[2]

D. K. Barton and H. R. Ward. 1969.Handbook of Radar Measurement

1969
[3]

1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates

1988
[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171–4186

2019
[5]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

Junqiao Fan, Jianfei Yang, Yuecong Xu, and Lihua Xie. 2024. Diffusion model is a good pose estimator from 3D RF-vision. InEuropean Conference on Computer Vision. Springer, 259–276

2024
[7]

Yuan Fang, Fangzhan Shi, Xijia Wei, Qingchao Chen, Kevin Chetty, and Si- mon Julier. 2025. CubeDN: Real-Time Drone Detection in 3D Space from Dual mmWave Radar Cubes. In2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3977–3983

2025
[8]

Yuan Fang, Fangzhan Shi, Xijia Wei, Qingchao Chen, Kevin Chetty, and Si- mon Julier. 2025. CubeDN: Real-Time Drone Detection in 3D Space from Dual mmWave Radar Cubes. In2025 IEEE International Conference on Robotics and Automation (ICRA). 3977–3983. doi:10.1109/ICRA55743.2025.11127766

work page doi:10.1109/icra55743.2025.11127766 2025
[9]

Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. 2022. Masked autoen- coders as spatiotemporal learners.Advances in neural information processing systems35 (2022), 35946–35958

2022
[10]

Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.Journal of the american statistical association 32, 200 (1937), 675–701

1937
[11]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick
[12]

(2022), 16000–16009

Masked autoencoders are scalable vision learners. (2022), 16000–16009

2022
[13]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

2020
[14]

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. 2022. Masked autoencoders that listen.Advances in Neural Information Processing Systems35 (2022), 28708– 28720

2022
[15]

Iovescu and S

C. Iovescu and S. Rao. 2017. The Fundamentals of Millimeter Wave Radar Sensors. https://www.ti.com/lit/wp/spyy005a/spyy005a.pdf

2017
[16]

Niraj Prakash Kini, Shih-Po Lee, and Jenq-Neng Hwang. 2026. milliMamba: Multi- frame mmWave radar pose estimation with state-space models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

2026
[17]

Shih-Po Lee, Niraj Prakash Kini, Wen-Hsiao Peng, Ching-Wen Ma, and Jenq-Neng Hwang. 2023. Hupr: A benchmark for human pose estimation using millimeter wave radar. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5715–5724

2023
[18]

M Mahbubur Rahman, Dario Martelli, and Sevgi Z Gurbuz. 2023. Radar-based human skeleton estimation with CNN-LSTM network trained with limited data. In2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 1–4

2023
[19]

Nornadiah Mohd Razali, Yap Bee Wah, et al. 2011. Power comparisons of shapiro- wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests.Journal of statistical modeling and analytics2, 1 (2011), 21–33

2011
[20]

Zhiyao Sheng, Huatao Xu, Qian Zhang, and Dong Wang. 2022. Facilitating radar- based gesture recognition with self-supervised learning. In2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). IEEE, 154–162

2022
[21]

A Soumya, C Krishna Mohan, and Linga Reddy Cenkeramaddi. 2023. Recent advances in mmWave-radar-based sensing, its applications, and machine learning techniques: A review.Sensors23, 21 (2023), 8901

2023
[22]

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems35 (2022), 10078–10093

2022
[23]

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. VideoMAE V2: Scaling video masked autoencoders with dual masking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14549–14560

2023
[24]

Xijia Wei, Yuan Fang, Kevin Chetty, Youngjun Cho, and Nadia Bianchi-Berthouze
[25]

InProceedings of the 2025 ACM Workshop on Access Networks with Artificial Intelligence

Vomee: A Multimodal Sensing Platform for Video, Audio, mmWave and Skeleton Data Capturing. InProceedings of the 2025 ACM Workshop on Access Networks with Artificial Intelligence. 36–40

2025
[26]

Xijia Wei, Yuan Fang, Zihan Liu, Xiyue Zhu, Jiawei Wang, Yifu Liu, Bruna Petreca, Sharon Baurley, Kevin Chetty, Youngjun Cho, et al. 2025. mmWaveTryOn: The first mmWave-RGB Dataset for Clothes Try-On Multimodal Gesture Recognition. InProceedings of the 2025 ACM Workshop on Access Networks with Artificial Intelligence. 31–35

2025
[27]

Eric W Weisstein. 2004. Bonferroni correction.https://mathworld. wolfram. com/ (2004)

2004
[28]

Robert F Woolson. 2007. Wilcoxon signed-rank test.Wiley encyclopedia of clinical trials(2007), 1–3

2007
[29]

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems35 (2022), 38571–38584

2022
[30]

Jia Zhang, Rui Xi, Yuan He, Yimiao Sun, Xiuzhen Guo, Weiguo Wang, Xin Na, Yunhao Liu, Zhenguo Shi, and Tao Gu. 2023. A survey of mmWave-based human sensing: Technology, platforms and applications.IEEE Communications Surveys & Tutorials25, 4 (2023), 2052–2087

2023
[31]

Peijun Zhao, Chris Xiaoxuan Lu, Bing Wang, Niki Trigoni, and Andrew Markham
[32]

Cubelearn: End-to-end learning for human motion recognition from raw mmwave radar signals.IEEE Internet of Things Journal10, 12 (2023), 10236–10249

2023
[33]

Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zheng- ming Ding. 2021. 3d human pose estimation with spatial and temporal transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 11656–11665

2021
[34]

Bing Zhu, Junqiao Fan, Jianfei Yang, and Lihua Xie. 2024. ProbRadarM3F: mmWave radar based 3D human pose estimation with probability map guided multi-format feature fusion.arXiv preprint arXiv:2410.05569(2024)

work page arXiv 2024