Motion Capture from Inertial and Vision Sensors

Qian Bao; Ruoli Dai; Tao Mei; Wu Liu; Xiaodong Chen; Xinchen Liu; Yongdong Zhang

arxiv: 2407.16341 · v4 · submitted 2024-07-23 · 💻 cs.CV

Motion Capture from Inertial and Vision Sensors

Xiaodong Chen , Wu Liu , Qian Bao , Xinchen Liu , Ruoli Dai , Yongdong Zhang , Tao Mei This is my paper

Pith reviewed 2026-05-23 22:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords motion captureIMUmonocular videohuman pose estimationmulti-modal fusionSMPL parametersconsumer hardware

0 comments

The pith

A monocular camera plus a few IMUs can capture human motion accurately enough for daily use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a large dataset of synchronized IMU readings and video frames that records 146 fine-grained actions across 400 minutes. It then introduces a network that fuses the two sensor streams to recover joint positions, rotations, and body shape parameters. The work matters because industrial motion capture still relies on dozens of cameras or many sensors, while consumer devices already carry one camera and can add cheap IMUs. If the fusion works, everyday phones or wearables could replace studio rigs for animation, fitness, or rehabilitation.

Core claim

The authors claim that inertial signals and monocular video supply complementary information that together suffice for accurate multi-person motion capture, and they demonstrate this sufficiency by releasing the MINIONS dataset and training SparseNet on it.

What carries the argument

SparseNet, a fusion network that learns to combine sparse IMU measurements with RGB video features to output SMPL parameters and joint angles.

If this is right

Motion capture becomes feasible with hardware already present in consumer phones and watches.
The released dataset supplies training data for other multi-modal pose estimators.
Sparse fusion reduces the sensor count needed for acceptable accuracy in interactive applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sensor mix could be tested on longer, continuous recordings to check drift accumulation.
Results may transfer to single-person tracking on mobile devices if the network is quantized.
Interactive actions in the dataset suggest possible extensions to two-person collaboration or sports analysis.

Load-bearing premise

The combination of one camera and very few IMUs is enough to produce accurate motion estimates outside controlled studio conditions.

What would settle it

Record the same actions with the proposed sensor mix and with a full optical marker system; if the average joint-position error stays above 5 cm or the rotation error above 10 degrees across diverse daily actions, the claim fails.

Figures

Figures reproduced from arXiv: 2407.16341 by Qian Bao, Ruoli Dai, Tao Mei, Wu Liu, Xiaodong Chen, Xinchen Liu, Yongdong Zhang.

**Figure 1.** Figure 1: Overview of our MINIONS dataset. It is collected by multiple types of sensors including eight 2K-resolution RGB cameras, Inertial Measurement Units (IMUs), and an RGB-D scanner. With the multi-modal data, we annotate human motion sequences with (d) 2D/3D joints, (e) the SMPL parameters, (f) the texture of each actor from a scanner, and fine-grained action types with textual descriptions. reflective markers… view at source ↗

**Figure 3.** Figure 3: Overview of dataset construction. (a) Textured mesh reconstruction with an RGB-D scanner; (b) 3D joints triangulation and tracking from multi-view videos; (c) Human pose from full-body IMUs data; and (d) Motion recovery from inertial and visual results. 3.1 Hardware Setup We collect raw data in multiple scenes using four to eight synchronized cameras and full-body IMU suits with 17 sensors, as shown in [P… view at source ↗

**Figure 4.** Figure 4: Example frame of motion recovery with inertial and visual data. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Fine-grained Actions. MINIONS contains 121 single-player actions and 25 multi-player actions including common person-person and person-object interactive actions in daily life. that, we post-process the 2D joints through DarkNet [59] to reduce jitters and improve accuracy. The detection result contains 25 joints P2d of body, face, and feet in the same format as OpenPose [11]. We discard the uncertain joint… view at source ↗

**Figure 6.** Figure 6: Qualitative results from single-subject motion capture data collection. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results from multi-subjects motion capture data collection. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 9.** Figure 9: Visualization. (a): Average angular error (in degree) over sequences. (b): Average translation error (in mm) over sequences. space and angular space. Additionally, we use the Jitter to measure the average jerk of body joints. Our experimental results are detailed in [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 8.** Figure 8: Visualization comparisons among the visionbased, IMUs-based, and multi-modal human motion capture. The vertexes are colored by the distances to the ground truth positions. Visualization. To facilitate a more intuitive comparison, we provide visualization results of visionbased, IMUs-based, and multi-modal motion capture in [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS dataset, we propose a SparseNet framework to capture human motion from IMUs and videos by discovering their supplementary features and exploring the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Introduces a sizable new multi-modal motion capture dataset (MINIONS) plus a SparseNet sketch, but the abstract gives no methods or results so the actual performance claims stay uncheckable.

read the letter

The paper's core contribution is the MINIONS dataset: over five million frames, 400 minutes, IMU signals paired with monocular RGB video, labeled with joint positions, rotations, and SMPL parameters, plus 146 fine-grained actions. They pair it with a SparseNet idea that tries to exploit complementary cues from the two sensor types for low-cost capture. That scale and the explicit multi-modality focus are the genuinely new pieces; prior IMU+vision work has been smaller or less diverse in actions. If the data collection protocol holds up and the labels are clean, this could become a practical training resource for graphics and daily-life tracking tasks. The abstract does not overclaim beyond that. The obvious limitation is that we have only the abstract. No architecture details for SparseNet, no sensor placement description, no collection protocol, no metrics, and no numbers appear, so it is impossible to tell whether the supplementary-feature claim or the consumer-affordable accuracy premise actually holds. That leaves the soundness assessment at the level of an unverified promise. The citation pattern looks standard for the subfield and does not raise red flags on its own. This is aimed at researchers who need large paired IMU-video corpora for fusion or reconstruction work; anyone already running motion-capture experiments might find the resource worth checking once released. It is the kind of dataset paper that warrants a full review rather than a desk reject, provided the authors supply the missing experimental section and make the data available.

Referee Report

1 major / 0 minor

Summary. The paper claims to contribute the MINIONS dataset—a large-scale collection of over five million frames (400 minutes) of synchronized IMU signals and monocular RGB videos, annotated with joint positions, joint rotations, and SMPL parameters across 146 fine-grained single and interactive actions—and the SparseNet framework that fuses inertial and visual data to enable accurate human motion capture using only a monocular camera and very few IMUs for consumer-affordable daily-life applications.

Significance. If validated, the MINIONS dataset would provide a substantial public resource for multi-modal motion capture research due to its scale, action diversity, and textual descriptions, while SparseNet could demonstrate practical sensor fusion for sparse setups, addressing the gap between industrial systems and accessible consumer solutions in graphics, vision, and AR/VR applications.

major comments (1)

[Abstract] Abstract: The central claims—that MINIONS enables consumer-affordable capture and that SparseNet discovers supplementary IMU-video features for accurate reconstruction—are presented without any description of the network architecture, sensor placement protocol, data collection procedure, loss functions, evaluation metrics, baselines, or quantitative results, making it impossible to assess whether the data or framework support the stated claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims—that MINIONS enables consumer-affordable capture and that SparseNet discovers supplementary IMU-video features for accurate reconstruction—are presented without any description of the network architecture, sensor placement protocol, data collection procedure, loss functions, evaluation metrics, baselines, or quantitative results, making it impossible to assess whether the data or framework support the stated claims.

Authors: We agree that the provided abstract is a high-level summary and does not contain the requested technical details. Abstracts are designed to be concise overviews; the full manuscript contains dedicated sections describing the SparseNet architecture, sensor placement, data collection protocol for the MINIONS dataset, loss functions, evaluation metrics, baselines, and quantitative results that support the claims. The referee summary already references these elements from the paper, indicating the full text was available for review. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided document consists only of an abstract with no equations, derivations, parameter fittings, or technical claims about how results are obtained from inputs. The work introduces a new dataset (MINIONS) and framework (SparseNet) as an empirical contribution for multi-modal motion capture, without presenting any derivation chain that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation load-bearing steps. No load-bearing assumptions are isolated or shown to be circular by the paper's own text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no equations, parameters, or modeling choices are described, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5745 in / 1043 out tokens · 23914 ms · 2026-05-23T22:28:46.981440+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

[1]

http://www.neuronmocap.com (2024) 1, 2, 4, 5, 6

Perception neuron. http://www.neuronmocap.com (2024) 1, 2, 4, 5, 6

work page 2024
[2]

http://www.vicon.com (2024) 1, 2

Vicon blade. http://www.vicon.com (2024) 1, 2

work page 2024
[3]

https://support.xbox.com (2024) 6

Xbox. https://support.xbox.com (2024) 6

work page 2024
[4]

https://www.ximea.com/en/products/usb- 31- gen- 1- with- sony-cmos-xic/mc023cg-sy (2024) 6

Ximea. https://www.ximea.com/en/products/usb- 31- gen- 1- with- sony-cmos-xic/mc023cg-sy (2024) 6

work page 2024
[5]

Springer (1997) 4

Aha, D.: Lazy Learning. Springer (1997) 4

work page 1997
[6]

In: ICCV

Alldieck, T., Xu, H., Sminchisescu, C.: imghum: Implicit generative models of 3d human shape and articulated pose. In: ICCV . pp. 5441–5450 (2021) 4

work page 2021
[7]

ACM TOG 24(3), 408–416 (2005) 4

Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM TOG 24(3), 408–416 (2005) 4

work page 2005
[8]

In: ICIP

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP. pp. 3464–3468 (2016) 8

work page 2016
[9]

In: CVPR

Black, M.J., Patel, P., Tesch, J., Yang, J.: Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR. pp. 8726–8737 (2023) 3

work page 2023
[10]

In: ECCV

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P.V ., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. In: ECCV . pp. 561– 578 (2016) 2, 4, 9

work page 2016
[11]

IEEE TPAMI43(1), 172–186 (2021) 8

Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y .: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI43(1), 172–186 (2021) 8

work page 2021
[12]

In: CVPR

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR. pp. 4724–4733 (2017) 13, 14

work page 2017
[13]

IEEE Access 8, 176241–176262 (2020) 3

Chatzitofis, A., Saroglou, L., Boutis, P., Drakoulis, P., Zioulis, N., Subramanyam, S., Kevel- ham, B., Charbonnier, C., Cesar, P., Zarpalas, D., et al.: Human4d: A human-centric mul- timodal dataset for motions and immersive media. IEEE Access 8, 176241–176262 (2020) 3

work page 2020
[14]

The Visual Computer 39(5), 1893–1906 (2023) 2, 4

Chen, D., Song, Y ., Liang, F., Ma, T., Zhu, X., Jia, T.: 3d human body reconstruction based on smpl model. The Visual Computer 39(5), 1893–1906 (2023) 2, 4

work page 1906
[15]

In: ICCV

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV . pp. 6201–6210 (2019) 3, 14

work page 2019
[16]

IJCV 127(4), 381–397 (2019) 5

Gilbert, A., Trumble, M., Malleson, C., Hilton, A., Collomosse, J.P.: Fusing visual and in- ertial sensors with semantics for 3d human pose estimation. IJCV 127(4), 381–397 (2019) 5

work page 2019
[17]

In: ICCV

Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: ICCV . pp. 1381–1388 (2009) 4

work page 2009
[18]

IEEE TIP 29, 8476–8489 (2020) 5

Henschel, R., von Marcard, T., Rosenhahn, B.: Accurate long-term multiple people tracking using video and body-worn imus. IEEE TIP 29, 8476–8489 (2020) 5

work page 2020
[19]

In: CVPR

Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: CVPR. pp. 13274–13285 (2022) 3

work page 2022
[20]

ACM TOG 37(6), 185 (2018) 3, 4, 7

Huang, Y ., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM TOG 37(6), 185 (2018) 3, 4, 7

work page 2018
[21]

IEEE TPAMI 36(7), 1325–1339 (2014) 1, 2, 3, 5, 8

Ionescu, C., Papava, D., Olaru, V ., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI 36(7), 1325–1339 (2014) 1, 2, 3, 5, 8

work page 2014
[22]

In: SIGGRAPH Asia

Jiang, Y ., Ye, Y ., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain genera- tion. In: SIGGRAPH Asia. pp. 3:1–3:9. ACM (2022) 4 16 Xiaodong Chen et al

work page 2022
[23]

In: ECCV

Jin, S., Xu, L., Xu, J., Wang, C., Liu, W., Qian, C., Ouyang, W., Luo, P.: Whole-body human pose estimation in the wild. In: ECCV . pp. 196–214 (2020) 7

work page 2020
[24]

In: CVPR

Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR. pp. 7122–7131 (2018) 2, 5

work page 2018
[25]

In: ICCV

Kocabas, M., Huang, C.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3d human body estimation. In: ICCV. pp. 11107–11117. IEEE (2021) 4

work page 2021
[26]

ICCV (2023) 13, 14

Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Wang, L., Qiao, Y .: Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ICCV (2023) 13, 14

work page 2023
[27]

In: AAAI

Liang, H., He, Y ., Zhao, C., Li, M., Wang, J., Yu, J., Xu, L.: Hybridcap: Inertia-aid monocular capture of challenging human motions. In: AAAI. pp. 1539–1548. AAAI Press (2023) 5

work page 2023
[28]

In: ICCV

Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV . pp. 7082–7092 (2019) 14

work page 2019
[29]

IEEE TPAMI 42(10), 2684– 2701 (2020) 13

Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L., Kot, A.C.: NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE TPAMI 42(10), 2684– 2701 (2020) 13

work page 2020
[30]

ACM Comput

Liu, W., Bao, Q., Sun, Y ., Mei, T.: Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective. ACM Comput. Surv. 55(4), 1–41 (2022) 4

work page 2022
[31]

ACM TOG 34(6), 248:1–248:16 (2015) 2, 3, 4, 9

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi- person linear model. ACM TOG 34(6), 248:1–248:16 (2015) 2, 3, 4, 9

work page 2015
[32]

IJCV 128(6), 1594–1611 (2020) 5

Malleson, C., Collomosse, J.P., Hilton, A.: Real-time multi-person motion capture from multi-view video and imus. IJCV 128(6), 1594–1611 (2020) 5

work page 2020
[33]

In: ECCV

von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering ac- curate 3d human pose in the wild using imus and a moving camera. In: ECCV . pp. 614–631 (2018) 3, 5, 8

work page 2018
[34]

IEEE TPAMI 38(8), 1533–1547 (2016) 5

von Marcard, T., Pons-Moll, G., Rosenhahn, B.: Human pose estimation from video and imus. IEEE TPAMI 38(8), 1533–1547 (2016) 5

work page 2016
[35]

CGF 36(2), 349–360 (2017) 4

von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: Auto- matic 3d human pose estimation from sparse imus. CGF 36(2), 349–360 (2017) 4

work page 2017
[36]

In: ICCV

Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: ICCV . pp. 2659–2668 (2017) 13

work page 2017
[37]

In: IEEE 3DV

Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved CNN supervision. In: IEEE 3DV . pp. 506–516 (2017) 1, 3, 5, 8

work page 2017
[38]

In: IEEE 3DV

Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., Theobalt, C.: Single-shot multi-person 3d pose estimation from monocular RGB. In: IEEE 3DV . pp. 120– 130 (2018) 1, 3

work page 2018
[39]

In: CVPR

Pavlakos, G., Choutas, V ., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR. pp. 10975–10985 (2019) 7

work page 2019
[40]

In: CVPR

Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR. pp. 7753–7762 (2019) 3, 13

work page 2019
[41]

Riaz, Q., Tao, G., Krüger, B., Weber, A.: Motion reconstruction using very few accelerome- ters and ground contacts. Graph. Model. 79, 23–38 (2015) 4

work page 2015
[42]

Xsens Technol 1(8) (2018) 4

Schepers, M., Giuberti, M., Bellusci, G., et al.: Xsens mvn: Consistent tracking of human motion using inertial sensing. Xsens Technol 1(8) (2018) 4

work page 2018
[43]

In: CVPR

Shahroudy, A., Liu, J., Ng, T., Wang, G.: NTU RGB+D: A large scale dataset for 3d human activity analysis. In: CVPR. pp. 1010–1019 (2016) 13

work page 2016
[44]

IEEE Trans

Shin, S., Li, Z., Halilaj, E.: Markerless motion tracking with noisy video and IMU data. IEEE Trans. Biomed. Eng. 70(11), 3082–3092 (2023) 5

work page 2023
[45]

In: NIPS

Sigal, L., Balan, A.O., Black, M.J.: Combined discriminative and generative articulated pose and non-rigid shape estimation. In: NIPS. pp. 1337–1344 (2007) 4 Motion Capture from Inertial and Vision Sensors 17

work page 2007
[46]

In: ICCV

Sun, Y ., Bao, Q., Liu, W., Fu, Y ., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3d people. In: ICCV . pp. 11159–11168 (2021) 2, 5, 11

work page 2021
[47]

ACM TOG30(3), 18:1–18:12 (2011) 4

Tautges, J., Zinke, A., Krüger, B., Baumann, J., Weber, A., Helten, T., Müller, M., Seidel, H., Eberhardt, B.: Motion reconstruction using sparse accelerometer data. ACM TOG30(3), 18:1–18:12 (2011) 4

work page 2011
[48]

In: ICCV

Tran, D., Wang, H., Feiszli, M., Torresani, L.: Video classification with channel-separated convolutional networks. In: ICCV . pp. 5551–5560 (2019) 14

work page 2019
[49]

In: CVPR

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y ., Paluri, M.: A closer look at spatiotem- poral convolutions for action recognition. In: CVPR. pp. 6450–6459 (2018) 14

work page 2018
[50]

In: BMVC (2017) 1, 2, 3, 5, 8

Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.P.: Total capture: 3d human pose estimation fusing video and inertial sensors. In: BMVC (2017) 1, 2, 3, 5, 8

work page 2017
[51]

IEEE TPAMI 43(10), 3349–3364 (2021) 7

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y ., Liu, D., Mu, Y ., Tan, M., Wang, X., Liu, W., Xiao, B.: Deep high-resolution representation learning for visual recognition. IEEE TPAMI 43(10), 3349–3364 (2021) 7

work page 2021
[52]

In: CVPR

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y ., Wang, Y ., Wang, Y ., Qiao, Y .: Videomae V2: scaling video masked autoencoders with dual masking. In: CVPR. pp. 14549–14560. IEEE (2023) 13, 14

work page 2023
[53]

IEEE TPAMI 41(11), 2740–2755 (2019) 14

Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., Gool, L.V .: Temporal segment networks for action recognition in videos. IEEE TPAMI 41(11), 2740–2755 (2019) 14

work page 2019
[54]

IEEE TPAMI (2022) 7

Xu, L., Jin, S., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Zoomnas: Searching for whole-body human pose estimation in the wild. IEEE TPAMI (2022) 7

work page 2022
[55]

In: CVPR

Yang, C., Xu, Y ., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: CVPR. pp. 588–597 (2020) 3, 13, 14

work page 2020
[56]

In: CVPR

Yi, X., Zhou, Y ., Habermann, M., Shimada, S., Golyanik, V ., Theobalt, C., Xu, F.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: CVPR. pp. 13157–13168. IEEE (2022) 4

work page 2022
[57]

ACM TOG 40(4), 86:1–86:13 (2021) 4, 11

Yi, X., Zhou, Y ., Xu, F.: Transpose: real-time 3d human translation and pose estimation with six inertial sensors. ACM TOG 40(4), 86:1–86:13 (2021) 4, 11

work page 2021
[58]

In: CVPR

Zhang, C., Pujades, S., Black, M.J., Pons-Moll, G.: Detailed, accurate, human shape estima- tion from clothed 3d scan sequences. In: CVPR. pp. 5484–5493 (2017) 7

work page 2017
[59]

In: CVPR

Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: CVPR. pp. 7091–7100 (2020) 8

work page 2020
[60]

IEEE Trans

Zhang, H., Tian, Y ., Zhang, Y ., Li, M., An, L., Sun, Z., Liu, Y .: Pymaf-x: Towards well- aligned full-body model regression from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12287–12303 (2023) 4

work page 2023
[61]

In: ICCV

Zhang, H., Tian, Y ., Zhou, X., Ouyang, W., Liu, Y ., Wang, L., Sun, Z.: Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV. pp. 11426–11436. IEEE (2021) 4

work page 2021
[62]

IEEE TPAMI22(11), 1330–1334 (2000) 6

Zhang, Z.: A flexible new technique for camera calibration. IEEE TPAMI22(11), 1330–1334 (2000) 6

work page 2000
[63]

In: ICCV

Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y .: Motionbert: A unified perspective on learning human motion representations. In: ICCV . pp. 15085–15099 (2023) 3, 13

work page 2023
[64]

6m 3d wholebody dataset and benchmark

Zhu, Y ., Samet, N., Picard, D.: H3wb: Human3. 6m 3d wholebody dataset and benchmark. In: ICCV . pp. 20166–20177 (2023) 3

work page 2023

[1] [1]

http://www.neuronmocap.com (2024) 1, 2, 4, 5, 6

Perception neuron. http://www.neuronmocap.com (2024) 1, 2, 4, 5, 6

work page 2024

[2] [2]

http://www.vicon.com (2024) 1, 2

Vicon blade. http://www.vicon.com (2024) 1, 2

work page 2024

[3] [3]

https://support.xbox.com (2024) 6

Xbox. https://support.xbox.com (2024) 6

work page 2024

[4] [4]

https://www.ximea.com/en/products/usb- 31- gen- 1- with- sony-cmos-xic/mc023cg-sy (2024) 6

Ximea. https://www.ximea.com/en/products/usb- 31- gen- 1- with- sony-cmos-xic/mc023cg-sy (2024) 6

work page 2024

[5] [5]

Springer (1997) 4

Aha, D.: Lazy Learning. Springer (1997) 4

work page 1997

[6] [6]

In: ICCV

Alldieck, T., Xu, H., Sminchisescu, C.: imghum: Implicit generative models of 3d human shape and articulated pose. In: ICCV . pp. 5441–5450 (2021) 4

work page 2021

[7] [7]

ACM TOG 24(3), 408–416 (2005) 4

Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM TOG 24(3), 408–416 (2005) 4

work page 2005

[8] [8]

In: ICIP

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP. pp. 3464–3468 (2016) 8

work page 2016

[9] [9]

In: CVPR

Black, M.J., Patel, P., Tesch, J., Yang, J.: Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR. pp. 8726–8737 (2023) 3

work page 2023

[10] [10]

In: ECCV

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P.V ., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. In: ECCV . pp. 561– 578 (2016) 2, 4, 9

work page 2016

[11] [11]

IEEE TPAMI43(1), 172–186 (2021) 8

Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y .: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI43(1), 172–186 (2021) 8

work page 2021

[12] [12]

In: CVPR

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR. pp. 4724–4733 (2017) 13, 14

work page 2017

[13] [13]

IEEE Access 8, 176241–176262 (2020) 3

Chatzitofis, A., Saroglou, L., Boutis, P., Drakoulis, P., Zioulis, N., Subramanyam, S., Kevel- ham, B., Charbonnier, C., Cesar, P., Zarpalas, D., et al.: Human4d: A human-centric mul- timodal dataset for motions and immersive media. IEEE Access 8, 176241–176262 (2020) 3

work page 2020

[14] [14]

The Visual Computer 39(5), 1893–1906 (2023) 2, 4

Chen, D., Song, Y ., Liang, F., Ma, T., Zhu, X., Jia, T.: 3d human body reconstruction based on smpl model. The Visual Computer 39(5), 1893–1906 (2023) 2, 4

work page 1906

[15] [15]

In: ICCV

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV . pp. 6201–6210 (2019) 3, 14

work page 2019

[16] [16]

IJCV 127(4), 381–397 (2019) 5

Gilbert, A., Trumble, M., Malleson, C., Hilton, A., Collomosse, J.P.: Fusing visual and in- ertial sensors with semantics for 3d human pose estimation. IJCV 127(4), 381–397 (2019) 5

work page 2019

[17] [17]

In: ICCV

Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: ICCV . pp. 1381–1388 (2009) 4

work page 2009

[18] [18]

IEEE TIP 29, 8476–8489 (2020) 5

Henschel, R., von Marcard, T., Rosenhahn, B.: Accurate long-term multiple people tracking using video and body-worn imus. IEEE TIP 29, 8476–8489 (2020) 5

work page 2020

[19] [19]

In: CVPR

Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: CVPR. pp. 13274–13285 (2022) 3

work page 2022

[20] [20]

ACM TOG 37(6), 185 (2018) 3, 4, 7

Huang, Y ., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM TOG 37(6), 185 (2018) 3, 4, 7

work page 2018

[21] [21]

IEEE TPAMI 36(7), 1325–1339 (2014) 1, 2, 3, 5, 8

Ionescu, C., Papava, D., Olaru, V ., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI 36(7), 1325–1339 (2014) 1, 2, 3, 5, 8

work page 2014

[22] [22]

In: SIGGRAPH Asia

Jiang, Y ., Ye, Y ., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain genera- tion. In: SIGGRAPH Asia. pp. 3:1–3:9. ACM (2022) 4 16 Xiaodong Chen et al

work page 2022

[23] [23]

In: ECCV

Jin, S., Xu, L., Xu, J., Wang, C., Liu, W., Qian, C., Ouyang, W., Luo, P.: Whole-body human pose estimation in the wild. In: ECCV . pp. 196–214 (2020) 7

work page 2020

[24] [24]

In: CVPR

Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR. pp. 7122–7131 (2018) 2, 5

work page 2018

[25] [25]

In: ICCV

Kocabas, M., Huang, C.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3d human body estimation. In: ICCV. pp. 11107–11117. IEEE (2021) 4

work page 2021

[26] [26]

ICCV (2023) 13, 14

Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Wang, L., Qiao, Y .: Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ICCV (2023) 13, 14

work page 2023

[27] [27]

In: AAAI

Liang, H., He, Y ., Zhao, C., Li, M., Wang, J., Yu, J., Xu, L.: Hybridcap: Inertia-aid monocular capture of challenging human motions. In: AAAI. pp. 1539–1548. AAAI Press (2023) 5

work page 2023

[28] [28]

In: ICCV

Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV . pp. 7082–7092 (2019) 14

work page 2019

[29] [29]

IEEE TPAMI 42(10), 2684– 2701 (2020) 13

Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L., Kot, A.C.: NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE TPAMI 42(10), 2684– 2701 (2020) 13

work page 2020

[30] [30]

ACM Comput

Liu, W., Bao, Q., Sun, Y ., Mei, T.: Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective. ACM Comput. Surv. 55(4), 1–41 (2022) 4

work page 2022

[31] [31]

ACM TOG 34(6), 248:1–248:16 (2015) 2, 3, 4, 9

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi- person linear model. ACM TOG 34(6), 248:1–248:16 (2015) 2, 3, 4, 9

work page 2015

[32] [32]

IJCV 128(6), 1594–1611 (2020) 5

Malleson, C., Collomosse, J.P., Hilton, A.: Real-time multi-person motion capture from multi-view video and imus. IJCV 128(6), 1594–1611 (2020) 5

work page 2020

[33] [33]

In: ECCV

von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering ac- curate 3d human pose in the wild using imus and a moving camera. In: ECCV . pp. 614–631 (2018) 3, 5, 8

work page 2018

[34] [34]

IEEE TPAMI 38(8), 1533–1547 (2016) 5

von Marcard, T., Pons-Moll, G., Rosenhahn, B.: Human pose estimation from video and imus. IEEE TPAMI 38(8), 1533–1547 (2016) 5

work page 2016

[35] [35]

CGF 36(2), 349–360 (2017) 4

von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: Auto- matic 3d human pose estimation from sparse imus. CGF 36(2), 349–360 (2017) 4

work page 2017

[36] [36]

In: ICCV

Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: ICCV . pp. 2659–2668 (2017) 13

work page 2017

[37] [37]

In: IEEE 3DV

Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved CNN supervision. In: IEEE 3DV . pp. 506–516 (2017) 1, 3, 5, 8

work page 2017

[38] [38]

In: IEEE 3DV

Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., Theobalt, C.: Single-shot multi-person 3d pose estimation from monocular RGB. In: IEEE 3DV . pp. 120– 130 (2018) 1, 3

work page 2018

[39] [39]

In: CVPR

Pavlakos, G., Choutas, V ., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR. pp. 10975–10985 (2019) 7

work page 2019

[40] [40]

In: CVPR

Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR. pp. 7753–7762 (2019) 3, 13

work page 2019

[41] [41]

Riaz, Q., Tao, G., Krüger, B., Weber, A.: Motion reconstruction using very few accelerome- ters and ground contacts. Graph. Model. 79, 23–38 (2015) 4

work page 2015

[42] [42]

Xsens Technol 1(8) (2018) 4

Schepers, M., Giuberti, M., Bellusci, G., et al.: Xsens mvn: Consistent tracking of human motion using inertial sensing. Xsens Technol 1(8) (2018) 4

work page 2018

[43] [43]

In: CVPR

Shahroudy, A., Liu, J., Ng, T., Wang, G.: NTU RGB+D: A large scale dataset for 3d human activity analysis. In: CVPR. pp. 1010–1019 (2016) 13

work page 2016

[44] [44]

IEEE Trans

Shin, S., Li, Z., Halilaj, E.: Markerless motion tracking with noisy video and IMU data. IEEE Trans. Biomed. Eng. 70(11), 3082–3092 (2023) 5

work page 2023

[45] [45]

In: NIPS

Sigal, L., Balan, A.O., Black, M.J.: Combined discriminative and generative articulated pose and non-rigid shape estimation. In: NIPS. pp. 1337–1344 (2007) 4 Motion Capture from Inertial and Vision Sensors 17

work page 2007

[46] [46]

In: ICCV

Sun, Y ., Bao, Q., Liu, W., Fu, Y ., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3d people. In: ICCV . pp. 11159–11168 (2021) 2, 5, 11

work page 2021

[47] [47]

ACM TOG30(3), 18:1–18:12 (2011) 4

Tautges, J., Zinke, A., Krüger, B., Baumann, J., Weber, A., Helten, T., Müller, M., Seidel, H., Eberhardt, B.: Motion reconstruction using sparse accelerometer data. ACM TOG30(3), 18:1–18:12 (2011) 4

work page 2011

[48] [48]

In: ICCV

Tran, D., Wang, H., Feiszli, M., Torresani, L.: Video classification with channel-separated convolutional networks. In: ICCV . pp. 5551–5560 (2019) 14

work page 2019

[49] [49]

In: CVPR

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y ., Paluri, M.: A closer look at spatiotem- poral convolutions for action recognition. In: CVPR. pp. 6450–6459 (2018) 14

work page 2018

[50] [50]

In: BMVC (2017) 1, 2, 3, 5, 8

Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.P.: Total capture: 3d human pose estimation fusing video and inertial sensors. In: BMVC (2017) 1, 2, 3, 5, 8

work page 2017

[51] [51]

IEEE TPAMI 43(10), 3349–3364 (2021) 7

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y ., Liu, D., Mu, Y ., Tan, M., Wang, X., Liu, W., Xiao, B.: Deep high-resolution representation learning for visual recognition. IEEE TPAMI 43(10), 3349–3364 (2021) 7

work page 2021

[52] [52]

In: CVPR

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y ., Wang, Y ., Wang, Y ., Qiao, Y .: Videomae V2: scaling video masked autoencoders with dual masking. In: CVPR. pp. 14549–14560. IEEE (2023) 13, 14

work page 2023

[53] [53]

IEEE TPAMI 41(11), 2740–2755 (2019) 14

Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., Gool, L.V .: Temporal segment networks for action recognition in videos. IEEE TPAMI 41(11), 2740–2755 (2019) 14

work page 2019

[54] [54]

IEEE TPAMI (2022) 7

Xu, L., Jin, S., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Zoomnas: Searching for whole-body human pose estimation in the wild. IEEE TPAMI (2022) 7

work page 2022

[55] [55]

In: CVPR

Yang, C., Xu, Y ., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: CVPR. pp. 588–597 (2020) 3, 13, 14

work page 2020

[56] [56]

In: CVPR

Yi, X., Zhou, Y ., Habermann, M., Shimada, S., Golyanik, V ., Theobalt, C., Xu, F.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: CVPR. pp. 13157–13168. IEEE (2022) 4

work page 2022

[57] [57]

ACM TOG 40(4), 86:1–86:13 (2021) 4, 11

Yi, X., Zhou, Y ., Xu, F.: Transpose: real-time 3d human translation and pose estimation with six inertial sensors. ACM TOG 40(4), 86:1–86:13 (2021) 4, 11

work page 2021

[58] [58]

In: CVPR

Zhang, C., Pujades, S., Black, M.J., Pons-Moll, G.: Detailed, accurate, human shape estima- tion from clothed 3d scan sequences. In: CVPR. pp. 5484–5493 (2017) 7

work page 2017

[59] [59]

In: CVPR

Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: CVPR. pp. 7091–7100 (2020) 8

work page 2020

[60] [60]

IEEE Trans

Zhang, H., Tian, Y ., Zhang, Y ., Li, M., An, L., Sun, Z., Liu, Y .: Pymaf-x: Towards well- aligned full-body model regression from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12287–12303 (2023) 4

work page 2023

[61] [61]

In: ICCV

Zhang, H., Tian, Y ., Zhou, X., Ouyang, W., Liu, Y ., Wang, L., Sun, Z.: Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV. pp. 11426–11436. IEEE (2021) 4

work page 2021

[62] [62]

IEEE TPAMI22(11), 1330–1334 (2000) 6

Zhang, Z.: A flexible new technique for camera calibration. IEEE TPAMI22(11), 1330–1334 (2000) 6

work page 2000

[63] [63]

In: ICCV

Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y .: Motionbert: A unified perspective on learning human motion representations. In: ICCV . pp. 15085–15099 (2023) 3, 13

work page 2023

[64] [64]

6m 3d wholebody dataset and benchmark

Zhu, Y ., Samet, N., Picard, D.: H3wb: Human3. 6m 3d wholebody dataset and benchmark. In: ICCV . pp. 20166–20177 (2023) 3

work page 2023