pith. sign in

arxiv: 2407.16341 · v4 · submitted 2024-07-23 · 💻 cs.CV

Motion Capture from Inertial and Vision Sensors

Pith reviewed 2026-05-23 22:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords motion captureIMUmonocular videohuman pose estimationmulti-modal fusionSMPL parametersconsumer hardware
0
0 comments X

The pith

A monocular camera plus a few IMUs can capture human motion accurately enough for daily use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a large dataset of synchronized IMU readings and video frames that records 146 fine-grained actions across 400 minutes. It then introduces a network that fuses the two sensor streams to recover joint positions, rotations, and body shape parameters. The work matters because industrial motion capture still relies on dozens of cameras or many sensors, while consumer devices already carry one camera and can add cheap IMUs. If the fusion works, everyday phones or wearables could replace studio rigs for animation, fitness, or rehabilitation.

Core claim

The authors claim that inertial signals and monocular video supply complementary information that together suffice for accurate multi-person motion capture, and they demonstrate this sufficiency by releasing the MINIONS dataset and training SparseNet on it.

What carries the argument

SparseNet, a fusion network that learns to combine sparse IMU measurements with RGB video features to output SMPL parameters and joint angles.

If this is right

  • Motion capture becomes feasible with hardware already present in consumer phones and watches.
  • The released dataset supplies training data for other multi-modal pose estimators.
  • Sparse fusion reduces the sensor count needed for acceptable accuracy in interactive applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sensor mix could be tested on longer, continuous recordings to check drift accumulation.
  • Results may transfer to single-person tracking on mobile devices if the network is quantized.
  • Interactive actions in the dataset suggest possible extensions to two-person collaboration or sports analysis.

Load-bearing premise

The combination of one camera and very few IMUs is enough to produce accurate motion estimates outside controlled studio conditions.

What would settle it

Record the same actions with the proposed sensor mix and with a full optical marker system; if the average joint-position error stays above 5 cm or the rotation error above 10 degrees across diverse daily actions, the claim fails.

Figures

Figures reproduced from arXiv: 2407.16341 by Qian Bao, Ruoli Dai, Tao Mei, Wu Liu, Xiaodong Chen, Xinchen Liu, Yongdong Zhang.

Figure 1
Figure 1. Figure 1: Overview of our MINIONS dataset. It is collected by multiple types of sensors including eight 2K-resolution RGB cameras, Inertial Measurement Units (IMUs), and an RGB-D scanner. With the multi-modal data, we annotate human motion sequences with (d) 2D/3D joints, (e) the SMPL parameters, (f) the texture of each actor from a scanner, and fine-grained action types with textual descriptions. reflective markers… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of dataset construction. (a) Textured mesh reconstruction with an RGB-D scanner; (b) 3D joints triangulation and tracking from multi-view videos; (c) Human pose from full-body IMUs data; and (d) Motion recovery from inertial and visual results. 3.1 Hardware Setup We collect raw data in multiple scenes using four to eight synchronized cameras and full-body IMU suits with 17 sensors, as shown in [P… view at source ↗
Figure 4
Figure 4. Figure 4: Example frame of motion recovery with inertial and visual data. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fine-grained Actions. MINIONS contains 121 single-player actions and 25 multi-player actions including common person-person and person-object interactive actions in daily life. that, we post-process the 2D joints through DarkNet [59] to reduce jitters and improve accuracy. The detection result contains 25 joints P2d of body, face, and feet in the same format as OpenPose [11]. We discard the uncertain joint… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results from single-subject motion capture data collection. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results from multi-subjects motion capture data collection. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization. (a): Average angular error (in degree) over sequences. (b): Average trans￾lation error (in mm) over sequences. space and angular space. Additionally, we use the Jitter to measure the average jerk of body joints. Our experimental results are detailed in [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization comparisons among the vision￾based, IMUs-based, and multi-modal human motion capture. The vertexes are colored by the distances to the ground truth positions. Visualization. To facilitate a more intuitive comparison, we pro￾vide visualization results of vision￾based, IMUs-based, and multi-modal motion capture in [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS dataset, we propose a SparseNet framework to capture human motion from IMUs and videos by discovering their supplementary features and exploring the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to contribute the MINIONS dataset—a large-scale collection of over five million frames (400 minutes) of synchronized IMU signals and monocular RGB videos, annotated with joint positions, joint rotations, and SMPL parameters across 146 fine-grained single and interactive actions—and the SparseNet framework that fuses inertial and visual data to enable accurate human motion capture using only a monocular camera and very few IMUs for consumer-affordable daily-life applications.

Significance. If validated, the MINIONS dataset would provide a substantial public resource for multi-modal motion capture research due to its scale, action diversity, and textual descriptions, while SparseNet could demonstrate practical sensor fusion for sparse setups, addressing the gap between industrial systems and accessible consumer solutions in graphics, vision, and AR/VR applications.

major comments (1)
  1. [Abstract] Abstract: The central claims—that MINIONS enables consumer-affordable capture and that SparseNet discovers supplementary IMU-video features for accurate reconstruction—are presented without any description of the network architecture, sensor placement protocol, data collection procedure, loss functions, evaluation metrics, baselines, or quantitative results, making it impossible to assess whether the data or framework support the stated claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims—that MINIONS enables consumer-affordable capture and that SparseNet discovers supplementary IMU-video features for accurate reconstruction—are presented without any description of the network architecture, sensor placement protocol, data collection procedure, loss functions, evaluation metrics, baselines, or quantitative results, making it impossible to assess whether the data or framework support the stated claims.

    Authors: We agree that the provided abstract is a high-level summary and does not contain the requested technical details. Abstracts are designed to be concise overviews; the full manuscript contains dedicated sections describing the SparseNet architecture, sensor placement, data collection protocol for the MINIONS dataset, loss functions, evaluation metrics, baselines, and quantitative results that support the claims. The referee summary already references these elements from the paper, indicating the full text was available for review. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided document consists only of an abstract with no equations, derivations, parameter fittings, or technical claims about how results are obtained from inputs. The work introduces a new dataset (MINIONS) and framework (SparseNet) as an empirical contribution for multi-modal motion capture, without presenting any derivation chain that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation load-bearing steps. No load-bearing assumptions are isolated or shown to be circular by the paper's own text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no equations, parameters, or modeling choices are described, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5745 in / 1043 out tokens · 23914 ms · 2026-05-23T22:28:46.981440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

  1. [1]

    http://www.neuronmocap.com (2024) 1, 2, 4, 5, 6

    Perception neuron. http://www.neuronmocap.com (2024) 1, 2, 4, 5, 6

  2. [2]

    http://www.vicon.com (2024) 1, 2

    Vicon blade. http://www.vicon.com (2024) 1, 2

  3. [3]

    https://support.xbox.com (2024) 6

    Xbox. https://support.xbox.com (2024) 6

  4. [4]

    https://www.ximea.com/en/products/usb- 31- gen- 1- with- sony-cmos-xic/mc023cg-sy (2024) 6

    Ximea. https://www.ximea.com/en/products/usb- 31- gen- 1- with- sony-cmos-xic/mc023cg-sy (2024) 6

  5. [5]

    Springer (1997) 4

    Aha, D.: Lazy Learning. Springer (1997) 4

  6. [6]

    In: ICCV

    Alldieck, T., Xu, H., Sminchisescu, C.: imghum: Implicit generative models of 3d human shape and articulated pose. In: ICCV . pp. 5441–5450 (2021) 4

  7. [7]

    ACM TOG 24(3), 408–416 (2005) 4

    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM TOG 24(3), 408–416 (2005) 4

  8. [8]

    In: ICIP

    Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP. pp. 3464–3468 (2016) 8

  9. [9]

    In: CVPR

    Black, M.J., Patel, P., Tesch, J., Yang, J.: Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR. pp. 8726–8737 (2023) 3

  10. [10]

    In: ECCV

    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P.V ., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. In: ECCV . pp. 561– 578 (2016) 2, 4, 9

  11. [11]

    IEEE TPAMI43(1), 172–186 (2021) 8

    Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y .: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI43(1), 172–186 (2021) 8

  12. [12]

    In: CVPR

    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR. pp. 4724–4733 (2017) 13, 14

  13. [13]

    IEEE Access 8, 176241–176262 (2020) 3

    Chatzitofis, A., Saroglou, L., Boutis, P., Drakoulis, P., Zioulis, N., Subramanyam, S., Kevel- ham, B., Charbonnier, C., Cesar, P., Zarpalas, D., et al.: Human4d: A human-centric mul- timodal dataset for motions and immersive media. IEEE Access 8, 176241–176262 (2020) 3

  14. [14]

    The Visual Computer 39(5), 1893–1906 (2023) 2, 4

    Chen, D., Song, Y ., Liang, F., Ma, T., Zhu, X., Jia, T.: 3d human body reconstruction based on smpl model. The Visual Computer 39(5), 1893–1906 (2023) 2, 4

  15. [15]

    In: ICCV

    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV . pp. 6201–6210 (2019) 3, 14

  16. [16]

    IJCV 127(4), 381–397 (2019) 5

    Gilbert, A., Trumble, M., Malleson, C., Hilton, A., Collomosse, J.P.: Fusing visual and in- ertial sensors with semantics for 3d human pose estimation. IJCV 127(4), 381–397 (2019) 5

  17. [17]

    In: ICCV

    Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: ICCV . pp. 1381–1388 (2009) 4

  18. [18]

    IEEE TIP 29, 8476–8489 (2020) 5

    Henschel, R., von Marcard, T., Rosenhahn, B.: Accurate long-term multiple people tracking using video and body-worn imus. IEEE TIP 29, 8476–8489 (2020) 5

  19. [19]

    In: CVPR

    Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: CVPR. pp. 13274–13285 (2022) 3

  20. [20]

    ACM TOG 37(6), 185 (2018) 3, 4, 7

    Huang, Y ., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM TOG 37(6), 185 (2018) 3, 4, 7

  21. [21]

    IEEE TPAMI 36(7), 1325–1339 (2014) 1, 2, 3, 5, 8

    Ionescu, C., Papava, D., Olaru, V ., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI 36(7), 1325–1339 (2014) 1, 2, 3, 5, 8

  22. [22]

    In: SIGGRAPH Asia

    Jiang, Y ., Ye, Y ., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain genera- tion. In: SIGGRAPH Asia. pp. 3:1–3:9. ACM (2022) 4 16 Xiaodong Chen et al

  23. [23]

    In: ECCV

    Jin, S., Xu, L., Xu, J., Wang, C., Liu, W., Qian, C., Ouyang, W., Luo, P.: Whole-body human pose estimation in the wild. In: ECCV . pp. 196–214 (2020) 7

  24. [24]

    In: CVPR

    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR. pp. 7122–7131 (2018) 2, 5

  25. [25]

    In: ICCV

    Kocabas, M., Huang, C.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3d human body estimation. In: ICCV. pp. 11107–11117. IEEE (2021) 4

  26. [26]

    ICCV (2023) 13, 14

    Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Wang, L., Qiao, Y .: Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ICCV (2023) 13, 14

  27. [27]

    In: AAAI

    Liang, H., He, Y ., Zhao, C., Li, M., Wang, J., Yu, J., Xu, L.: Hybridcap: Inertia-aid monocular capture of challenging human motions. In: AAAI. pp. 1539–1548. AAAI Press (2023) 5

  28. [28]

    In: ICCV

    Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV . pp. 7082–7092 (2019) 14

  29. [29]

    IEEE TPAMI 42(10), 2684– 2701 (2020) 13

    Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L., Kot, A.C.: NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE TPAMI 42(10), 2684– 2701 (2020) 13

  30. [30]

    ACM Comput

    Liu, W., Bao, Q., Sun, Y ., Mei, T.: Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective. ACM Comput. Surv. 55(4), 1–41 (2022) 4

  31. [31]

    ACM TOG 34(6), 248:1–248:16 (2015) 2, 3, 4, 9

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi- person linear model. ACM TOG 34(6), 248:1–248:16 (2015) 2, 3, 4, 9

  32. [32]

    IJCV 128(6), 1594–1611 (2020) 5

    Malleson, C., Collomosse, J.P., Hilton, A.: Real-time multi-person motion capture from multi-view video and imus. IJCV 128(6), 1594–1611 (2020) 5

  33. [33]

    In: ECCV

    von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering ac- curate 3d human pose in the wild using imus and a moving camera. In: ECCV . pp. 614–631 (2018) 3, 5, 8

  34. [34]

    IEEE TPAMI 38(8), 1533–1547 (2016) 5

    von Marcard, T., Pons-Moll, G., Rosenhahn, B.: Human pose estimation from video and imus. IEEE TPAMI 38(8), 1533–1547 (2016) 5

  35. [35]

    CGF 36(2), 349–360 (2017) 4

    von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: Auto- matic 3d human pose estimation from sparse imus. CGF 36(2), 349–360 (2017) 4

  36. [36]

    In: ICCV

    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: ICCV . pp. 2659–2668 (2017) 13

  37. [37]

    In: IEEE 3DV

    Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved CNN supervision. In: IEEE 3DV . pp. 506–516 (2017) 1, 3, 5, 8

  38. [38]

    In: IEEE 3DV

    Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., Theobalt, C.: Single-shot multi-person 3d pose estimation from monocular RGB. In: IEEE 3DV . pp. 120– 130 (2018) 1, 3

  39. [39]

    In: CVPR

    Pavlakos, G., Choutas, V ., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR. pp. 10975–10985 (2019) 7

  40. [40]

    In: CVPR

    Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR. pp. 7753–7762 (2019) 3, 13

  41. [41]

    Riaz, Q., Tao, G., Krüger, B., Weber, A.: Motion reconstruction using very few accelerome- ters and ground contacts. Graph. Model. 79, 23–38 (2015) 4

  42. [42]

    Xsens Technol 1(8) (2018) 4

    Schepers, M., Giuberti, M., Bellusci, G., et al.: Xsens mvn: Consistent tracking of human motion using inertial sensing. Xsens Technol 1(8) (2018) 4

  43. [43]

    In: CVPR

    Shahroudy, A., Liu, J., Ng, T., Wang, G.: NTU RGB+D: A large scale dataset for 3d human activity analysis. In: CVPR. pp. 1010–1019 (2016) 13

  44. [44]

    IEEE Trans

    Shin, S., Li, Z., Halilaj, E.: Markerless motion tracking with noisy video and IMU data. IEEE Trans. Biomed. Eng. 70(11), 3082–3092 (2023) 5

  45. [45]

    In: NIPS

    Sigal, L., Balan, A.O., Black, M.J.: Combined discriminative and generative articulated pose and non-rigid shape estimation. In: NIPS. pp. 1337–1344 (2007) 4 Motion Capture from Inertial and Vision Sensors 17

  46. [46]

    In: ICCV

    Sun, Y ., Bao, Q., Liu, W., Fu, Y ., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3d people. In: ICCV . pp. 11159–11168 (2021) 2, 5, 11

  47. [47]

    ACM TOG30(3), 18:1–18:12 (2011) 4

    Tautges, J., Zinke, A., Krüger, B., Baumann, J., Weber, A., Helten, T., Müller, M., Seidel, H., Eberhardt, B.: Motion reconstruction using sparse accelerometer data. ACM TOG30(3), 18:1–18:12 (2011) 4

  48. [48]

    In: ICCV

    Tran, D., Wang, H., Feiszli, M., Torresani, L.: Video classification with channel-separated convolutional networks. In: ICCV . pp. 5551–5560 (2019) 14

  49. [49]

    In: CVPR

    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y ., Paluri, M.: A closer look at spatiotem- poral convolutions for action recognition. In: CVPR. pp. 6450–6459 (2018) 14

  50. [50]

    In: BMVC (2017) 1, 2, 3, 5, 8

    Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.P.: Total capture: 3d human pose estimation fusing video and inertial sensors. In: BMVC (2017) 1, 2, 3, 5, 8

  51. [51]

    IEEE TPAMI 43(10), 3349–3364 (2021) 7

    Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y ., Liu, D., Mu, Y ., Tan, M., Wang, X., Liu, W., Xiao, B.: Deep high-resolution representation learning for visual recognition. IEEE TPAMI 43(10), 3349–3364 (2021) 7

  52. [52]

    In: CVPR

    Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y ., Wang, Y ., Wang, Y ., Qiao, Y .: Videomae V2: scaling video masked autoencoders with dual masking. In: CVPR. pp. 14549–14560. IEEE (2023) 13, 14

  53. [53]

    IEEE TPAMI 41(11), 2740–2755 (2019) 14

    Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., Gool, L.V .: Temporal segment networks for action recognition in videos. IEEE TPAMI 41(11), 2740–2755 (2019) 14

  54. [54]

    IEEE TPAMI (2022) 7

    Xu, L., Jin, S., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Zoomnas: Searching for whole-body human pose estimation in the wild. IEEE TPAMI (2022) 7

  55. [55]

    In: CVPR

    Yang, C., Xu, Y ., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: CVPR. pp. 588–597 (2020) 3, 13, 14

  56. [56]

    In: CVPR

    Yi, X., Zhou, Y ., Habermann, M., Shimada, S., Golyanik, V ., Theobalt, C., Xu, F.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: CVPR. pp. 13157–13168. IEEE (2022) 4

  57. [57]

    ACM TOG 40(4), 86:1–86:13 (2021) 4, 11

    Yi, X., Zhou, Y ., Xu, F.: Transpose: real-time 3d human translation and pose estimation with six inertial sensors. ACM TOG 40(4), 86:1–86:13 (2021) 4, 11

  58. [58]

    In: CVPR

    Zhang, C., Pujades, S., Black, M.J., Pons-Moll, G.: Detailed, accurate, human shape estima- tion from clothed 3d scan sequences. In: CVPR. pp. 5484–5493 (2017) 7

  59. [59]

    In: CVPR

    Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: CVPR. pp. 7091–7100 (2020) 8

  60. [60]

    IEEE Trans

    Zhang, H., Tian, Y ., Zhang, Y ., Li, M., An, L., Sun, Z., Liu, Y .: Pymaf-x: Towards well- aligned full-body model regression from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12287–12303 (2023) 4

  61. [61]

    In: ICCV

    Zhang, H., Tian, Y ., Zhou, X., Ouyang, W., Liu, Y ., Wang, L., Sun, Z.: Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV. pp. 11426–11436. IEEE (2021) 4

  62. [62]

    IEEE TPAMI22(11), 1330–1334 (2000) 6

    Zhang, Z.: A flexible new technique for camera calibration. IEEE TPAMI22(11), 1330–1334 (2000) 6

  63. [63]

    In: ICCV

    Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y .: Motionbert: A unified perspective on learning human motion representations. In: ICCV . pp. 15085–15099 (2023) 3, 13

  64. [64]

    6m 3d wholebody dataset and benchmark

    Zhu, Y ., Samet, N., Picard, D.: H3wb: Human3. 6m 3d wholebody dataset and benchmark. In: ICCV . pp. 20166–20177 (2023) 3