Recognition: unknown
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
Pith reviewed 2026-05-09 21:55 UTC · model grok-4.3
The pith
Wearable inertial sensors enable 4D human motion and scene layout reconstruction by repurposing large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IMU-to-4D is a framework that repurposes large language models to predict detailed 4D human motion together with coarse scene structure directly from signals of a few inertial sensors, yielding more coherent and temporally stable outputs than state-of-the-art cascaded pipelines across diverse human-scene datasets.
What carries the argument
The IMU-to-4D framework, which repurposes large language models to perform non-visual spatiotemporal reasoning that jointly infers human motion dynamics and rough 3D scene layouts from sparse IMU data.
If this is right
- Wearable devices can deliver coherent 4D perception without relying on cameras or visual processing.
- Motion data alone supports inference of coarse 3D scene layouts in addition to human poses.
- Outputs maintain higher temporal stability than cascaded visual pipelines across varied datasets.
- Applications become viable in privacy-sensitive or energy-constrained settings where cameras are unsuitable.
Where Pith is reading between the lines
- The approach may extend naturally to real-time operation on consumer-grade hardware if model inference costs are managed.
- Combining IMU signals with minimal additional non-visual cues could resolve ambiguities in scene prediction.
- Similar repurposing of language models might apply to inferring object interactions or human intent from motion patterns alone.
Load-bearing premise
Large language models can map sparse non-visual inertial signals to accurate detailed human motion and scene structure without visual supervision or explicit geometric priors.
What would settle it
A test set of IMU recordings from identical human motions performed inside structurally different scenes, with evaluation of whether the predicted scene layouts distinguish the environments correctly.
Figures
read the original abstract
Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces IMU-to-4D, a framework that repurposes large language models to reconstruct detailed 4D human motion together with coarse 3D scene layouts directly from sparse wearable IMU signals (earbuds, watches, smartphones), claiming that this yields more coherent and temporally stable results than state-of-the-art cascaded pipelines across diverse human-scene datasets.
Significance. If the empirical superiority holds with proper validation, the work would be significant for enabling privacy-preserving, low-energy 4D human-scene understanding without cameras. However, the absence of any quantitative metrics, ablation studies, or implementation details prevents assessment of whether the result actually demonstrates non-visual scene inference or merely reflects memorized correlations.
major comments (2)
- [Abstract] Abstract: the central empirical claim that IMU-to-4D 'yields more coherent and temporally stable results than SoTA cascaded pipelines' supplies no quantitative metrics (e.g., no coherence scores, temporal stability measures, or numerical comparison values), no tables, and no ablation studies, so the superiority cannot be evaluated.
- [Abstract] Abstract / Method description: IMU signals provide only local acceleration and angular velocity with no direct scene geometry; the manuscript gives no information on (a) IMU tokenization for the LLM, (b) whether scene supervision derives from visual ground truth or motion alone, or (c) any test-time ablation removing visual cues, which is required to attribute results to non-visual 4D understanding.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We agree that the abstract and method sections would benefit from more explicit quantitative support and implementation details to allow readers to fully evaluate the claims. We will revise the manuscript to incorporate these elements.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim that IMU-to-4D 'yields more coherent and temporally stable results than SoTA cascaded pipelines' supplies no quantitative metrics (e.g., no coherence scores, temporal stability measures, or numerical comparison values), no tables, and no ablation studies, so the superiority cannot be evaluated.
Authors: We acknowledge that the abstract as currently written summarizes the empirical findings without specific numbers. The experiments section of the manuscript contains quantitative comparisons, tables, and ablation studies on coherence and temporal stability metrics across datasets. To address the concern directly, we will revise the abstract to include representative numerical results and explicit references to the supporting tables and ablations. revision: yes
-
Referee: [Abstract] Abstract / Method description: IMU signals provide only local acceleration and angular velocity with no direct scene geometry; the manuscript gives no information on (a) IMU tokenization for the LLM, (b) whether scene supervision derives from visual ground truth or motion alone, or (c) any test-time ablation removing visual cues, which is required to attribute results to non-visual 4D understanding.
Authors: We agree these details are necessary to substantiate the non-visual claim. We will expand the method section in the revision to fully describe (a) the IMU tokenization procedure used to prepare signals for the LLM, (b) the source of scene supervision during training, and (c) an ablation that isolates performance when visual cues are removed. This will allow clear attribution to IMU-based inference. revision: yes
Circularity Check
No circularity: empirical framework with no derivation chain
full rationale
The paper introduces IMU-to-4D as an empirical framework that repurposes LLMs to map wearable IMU signals to 4D human motion and coarse scene structure. All central claims rest on experimental comparisons across human-scene datasets showing improved coherence over cascaded pipelines. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described approach. The result is presented as data-driven evidence rather than a mathematical reduction to its own inputs, making the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the 16 H
Araújo, J.P., Li, J., Vetrivel, K., Agarwal, R., Wu, J., Gopinath, D., Clegg, A.W., Liu, K.: Circle: Capture in rich contextual environments. In: Proceedings of the 16 H. Hsu, T. Cheng et al. IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21211– 21221 (2023)
2023
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bhatnagar, B.L., Xie, X., Petrov, I.A., Sminchisescu, C., Theobalt, C., Pons- Moll, G.: Behave: Dataset and method for tracking human object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15935–15946 (2022)
2022
-
[4]
In: 2024 International Conference on 3D Vision (3DV)
Braun, J., Christen, S., Kocabas, M., Aksan, E., Hilliges, O.: Physically plausible full-body hand-object interaction synthesis. In: 2024 International Conference on 3D Vision (3DV). pp. 464–473. IEEE (2024)
2024
-
[5]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Castillo, A., Escobar, M., Jeanneret, G., Pumarola, A., Arbeláez, P., Thabet, A., Sanakoyeu, A.: Bodiffusion: Diffusing sparse observations for full-body human motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4221–4231 (2023)
2023
-
[6]
Chen, L.H., Lu, S., Zeng, A., Zhang, H., Wang, B., Zhang, R., Zhang, L.: Mo- tionllm: Understanding human behaviors from human motions and videos. arXiv preprint arXiv:2405.20340 (2024)
-
[7]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)
work page internal anchor Pith review arXiv 2025
-
[8]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review arXiv 2024
-
[9]
In: Proceedings of the IEEE/CVF international conference on computer vision
Chung, J., Wuu, C.h., Yang, H.r., Tai, Y.W., Tang, C.K.: Haa500: Human-centric atomic action dataset with curated videos. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13465–13474 (2021)
2021
-
[10]
IEEE transactions on pattern analysis and machine intelligence (2024)
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: Linking 3d human poses and natural language. IEEE transactions on pattern analysis and machine intelligence (2024)
2024
-
[11]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)
work page internal anchor Pith review arXiv 2025
-
[12]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectionaltransformersforlanguageunderstanding.In:Proceedingsofthe2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)
2019
-
[13]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Dittadi, A., Dziadzio, S., Cosker, D., Lundell, B., Cashman, T.J., Shotton, J.: Full-body motion from a single head-mounted device: Generating smpl poses from partial observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11687–11697 (2021)
2021
-
[14]
In: Proceedings of the IEEE/CVFconferenceoncomputervisionandpatternrecognition.pp.1323–1333 (2024)
Dwivedi, S.K., Sun, Y., Patel, P., Feng, Y., Black, M.J.: Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. In: Proceedings of the IEEE/CVFconferenceoncomputervisionandpatternrecognition.pp.1323–1333 (2024)
2024
-
[15]
In: Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 17 Proceedings of the IEEE/CVF International Conference on Computer Vision
Fan, K., Lu, S., Dai, M., Yu, R., Xiao, L., Dou, Z., Dong, J., Ma, L., Wang, J.: Go to zero: Towards zero-shot motion generation with million-scale data. In: Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 17 Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13336–13348 (2025)
2025
-
[16]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: Chat- ting about 3d human pose. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 2093–2103 (2024)
2093
-
[17]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Fieraru, M., Zanfir, M., Oneata, E., Popa, A.I., Olaru, V., Sminchisescu, C.: Learning complex 3d human self-contact. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1343–1351 (2021)
2021
-
[18]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Fieraru, M., Zanfir, M., Pirlea, S.C., Olaru, V., Sminchisescu, C.: Aifit: Automatic 3d human-interpretable feedback models for fitness training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9919– 9928 (2021)
2021
-
[19]
Seed-x: Multimodal models with unified multi-granularity comprehension and generation
Ge, Y., Zhao, S., Zhu, J., Ge, Y., Yi, K., Song, L., Li, C., Ding, X., Shan, Y.: Seed- x: Multimodal models with unified multi-granularity comprehension and genera- tion. arXiv preprint arXiv:2404.14396 (2024)
-
[20]
In: Computer Graphics Forum
Ghosh, A., Dabral, R., Golyanik, V., Theobalt, C., Slusallek, P.: Imos: Intent- driven full-body motion synthesis for human-object interactions. In: Computer Graphics Forum. vol. 42, pp. 1–12. Wiley Online Library (2023)
2023
-
[21]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Nov 2024)
Ghosh, S., Kumar, S., Seth, A., Evuru, C.K.R., Tyagi, U., Sakshi, S., Nieto, O., Duraiswami, R., Manocha, D.: GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Nov 2024)
2024
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)
2024
-
[23]
Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025
Guo, C., Hwang, I., Wang, J., Zhou, B.: Snapmogen: Human motion generation from expressive texts. arXiv preprint arXiv:2507.09122 (2025)
-
[24]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5152–5161 (2022)
2022
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (hps): 3dhumanposeestimationandself-localizationinlarge scenes frombody-mounted sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4318–4329 (2021)
2021
-
[26]
In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision
Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M.J.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11374–11384 (2021)
2021
-
[27]
In: Proceedings of the IEEE/CVF inter- national conference on computer vision
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambiguities with 3d scene constraints. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 2282–2292 (2019)
2019
-
[28]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Hassan,M.,Ghosh,P.,Tesch,J.,Tzionas,D.,Black,M.J.:Populating3dscenesby learning human-scene interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14708–14718 (2021)
2021
-
[29]
In: ACM SIGGRAPH 2023 Conference Proceedings
Hassan, M., Guo, Y., Wang, T., Black, M., Fidler, S., Peng, X.B.: Synthesiz- ing physical character-scene interactions. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–9 (2023) 18 H. Hsu, T. Cheng et al
2023
-
[30]
In: Proceedings of the Com- puter Vision and Pattern Recognition Conference
Hong, F., Guzov, V., Kim, H.J., Ye, Y., Newcombe, R., Liu, Z., Ma, L.: Egolm: Multi-modal language model of egocentric motions. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5344–5354 (2025)
2025
-
[31]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., Zhu, S.C.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16750–16761 (2023)
2023
-
[32]
ACM Transactions on Graphics (TOG)37(6), 1–15 (2018)
Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics (TOG)37(6), 1–15 (2018)
2018
-
[33]
arXiv preprint arXiv:2503.16289 (2025)
Hwang, I., Zhou, B., Kim, Y.M., Wang, J., Guo, C.: Scenemi: Mo- tion in-betweening for modeling human-scene interactions. arXiv preprint arXiv:2503.16289 (2025)
-
[34]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Es- mail,A.,Equi,M.,Finn,C.,Fusai,N.,etal.:π0.5:avision-language-actionmodel with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Advances in Neural Information Processing Systems36, 20067–20079 (2023)
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36, 20067–20079 (2023)
2023
-
[36]
In: Euro- pean conference on computer vision
Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatar- poser: Articulated full-body pose tracking from sparse motion sensing. In: Euro- pean conference on computer vision. pp. 443–460. Springer (2022)
2022
-
[37]
In: SIGGRAPH Asia 2024 Conference Papers
Jiang, N., He, Z., Wang, Z., Li, H., Chen, Y., Huang, S., Zhu, Y.: Autonomous character-scene interaction synthesis from text instruction. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)
2024
-
[38]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jiang, N., Zhang, Z., Li, H., Ma, X., Wang, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Scaling up dynamic human-scene interaction modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1737– 1747 (2024)
2024
-
[39]
In: SIGGRAPH Asia 2022 Conference Papers
Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. In: SIGGRAPH Asia 2022 Conference Papers. pp. 1–9 (2022)
2022
-
[40]
In: Proceed- ings of the Computer Vision and Pattern Recognition Conference
Kim, J., Kim, J., Na, J., Joo, H.: Parahome: Parameterizing everyday home activ- ities towards 3d generative modeling of human-object interactions. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 1816–1828 (2025)
2025
-
[41]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)
work page internal anchor Pith review arXiv 2024
-
[42]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kulkarni,N.,Rempe,D.,Genova,K.,Kundu,A.,Johnson,J.,Fouhey,D.,Guibas, L.: Nifty: Neural object interaction fields for guided human motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 947–957 (2024)
2024
-
[43]
In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision
Lee, J., Joo, H.: Locomotion-action-manipulation: Synthesizing human-scene in- teractions in complex 3d environments. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 9663–9674 (2023) Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 19
2023
-
[44]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lee, J., Joo, H.: Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head-mounted camera. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1091–1100 (2024)
2024
-
[45]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Li, B., Zhao, Y., Zhelun, S., Sheng, L.: Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 1272–1279 (2022)
2022
-
[46]
In: European Conference on Computer Vision
Li, J., Clegg, A., Mottaghi, R., Wu, J., Puig, X., Liu, C.K.: Controllable human- object interaction synthesis. In: European Conference on Computer Vision. pp. 54–72. Springer (2024)
2024
-
[47]
ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)
Li, J., Wu, J., Liu, C.K.: Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)
2023
-
[48]
In: ACM SIGGRAPH 2024 Conference Papers
Li, J., Huang, T., Zhu, Q., Wong, T.T.: Physics-based scene layout generation from human motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–10 (2024)
2024
-
[49]
Genmo: A generalist model for human motion.arXiv preprint arXiv: 2505.01425,
Li, J., Cao, J., Zhang, H., Rempe, D., Kautz, J., Iqbal, U., Yuan, Y.: Genmo: A generalist model for human motion. arXiv preprint arXiv:2505.01425 (2025)
-
[50]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision
Li,R.,Zhao,J.,Zhang,Y.,Su,M.,Ren,Z.,Zhang,H.,Tang,Y.,Li,X.:Finedance: A fine-grained choreography dataset for 3d full body dance generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10234–10243 (2023)
2023
-
[51]
Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., Lee, Y.T.: Text- booksareallyouneedii:phi-1.5technicalreport.arXivpreprintarXiv:2309.05463 (2023)
work page internal anchor Pith review arXiv 2023
-
[52]
International Journal of Computer Vision132(9), 3463–3483 (2024)
Liang, H., Zhang, W., Li, W., Yu, J., Xu, L.: Intergen: Diffusion-based multi- human motion generation under complex interactions. International Journal of Computer Vision132(9), 3463–3483 (2024)
2024
-
[53]
In: Text summarization branches out
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
2004
-
[54]
Advances in Neural Information Processing Systems36, 25268–25280 (2023)
Lin, J., Zeng, A., Lu, S., Cai, Y., Zhang, R., Wang, H., Zhang, L.: Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems36, 25268–25280 (2023)
2023
-
[55]
World model on million-length video and language with ringattention
Liu, H., Yan, W., Zaharia, M., Abbeel, P.: World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 (2024)
-
[56]
In: NeurIPS (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
2023
-
[57]
Lu, J., Huang, C.H.P., Bhattacharya, U., Huang, Q., Zhou, Y.: Humoto: A 4d dataset of mocap human object interactions. arXiv preprint arXiv:2504.10414 (2025)
-
[58]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Lu, S., Wang, J., Lu, Z., Chen, L.H., Dai, W., Dong, J., Dou, Z., Dai, B., Zhang, R.: Scamo: Exploring the scaling law in autoregressive motion generation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27872–27882 (2025)
2025
-
[59]
In: Proceedings of the IEEE/CVF international conference on computer vision
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5442–5451 (2019)
2019
-
[60]
ACM Transactions on Graphics (TOG)39(4), 39–1 (2020) 20 H
Merel, J., Tunyasuvunakool, S., Ahuja, A., Tassa, Y., Hasenclever, L., Pham, V., Erez, T., Wayne, G., Heess, N.: Catch & carry: reusable neural controllers for vision-guided whole-body tasks. ACM Transactions on Graphics (TOG)39(4), 39–1 (2020) 20 H. Hsu, T. Cheng et al
2020
-
[61]
In: 2024 International Conference on 3D Vision (3DV)
Mir, A., Puig, X., Kanazawa, A., Pons-Moll, G.: Generating continual human mo- tion in diverse 3d scenes. In: 2024 International Conference on 3D Vision (3DV). pp. 903–913. IEEE (2024)
2024
-
[62]
Computer vision and image understanding81(3), 231–268 (2001)
Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. Computer vision and image understanding81(3), 231–268 (2001)
2001
-
[63]
In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems
Mollyn, V., Arakawa, R., Goel, M., Harrison, C., Ahuja, K.: Imuposer: Full-body pose estimation using imus in phones, watches, and earbuds. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–12 (2023)
2023
-
[64]
In: European Conference on Computer Vision
Nie, Y., Dai, A., Han, X., Nießner, M.: Pose2room: understanding 3d scenes from human activities. In: European Conference on Computer Vision. pp. 425–443. Springer (2022)
2022
-
[65]
In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
2002
-
[66]
In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision
Patel,C.,Nakamura,H.,Kyuragi,Y.,Kozuka,K.,Niebles,J.C.,Adeli,E.:Uniego- motion: A unified model for egocentric motion reconstruction, forecasting, and generation. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 10318–10329 (2025)
2025
-
[67]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)
2019
-
[68]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)
work page internal anchor Pith review arXiv 2025
-
[69]
In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision
Petrov, I.A., Marin, R., Chibane, J., Pons-Moll, G.: Tridi: Trilateral diffusion of 3d humans, objects, and interactions. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 5523–5535 (2025)
2025
-
[70]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., Torralba, A.: Virtual- home: Simulating household activities via programs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8494–8502 (2018)
2018
-
[71]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: Babel: Bodies, action and behavior with english labels. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 722–731 (2021)
2021
-
[72]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2545–2555 (2025)
2025
-
[73]
OpenAI blog1(8), 9 (2019)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Lan- guage models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)
2019
-
[74]
Xsens Technol1(8) (2018)
Schepers, M., Giuberti, M., Bellusci, G., et al.: Xsens mvn: Consistent tracking of human motion using inertial sensing. Xsens Technol1(8) (2018)
2018
-
[75]
arXiv preprint arXiv:2503.14919 (2025)
Shi, J., Liu, L., Sun, Y., Zhang, Z., Zhou, J., Nie, Q.: Genm3: Generative pre- trained multi-path motion model for text conditional human motion generation. arXiv preprint arXiv:2503.14919 (2025)
-
[76]
sony.net/Products/mocopi-dev/en/, accessed: November 2025
Sony Corporation: mocopi - mobile motion capture system (2025),https://www. sony.net/Products/mocopi-dev/en/, accessed: November 2025
2025
-
[77]
Talukder, S., Yue, Y., Gkioxari, G.: Totem: Tokenized time series embeddings for general time series analysis. arXiv preprint arXiv:2402.16412 (2024) Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 21
-
[78]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Team, C.: Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.098189(8) (2024)
work page internal anchor Pith review arXiv 2024
-
[79]
In: ISMIR
Tsuchida, S., Fukayama, S., Hamasaki, M., Goto, M.: Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information pro- cessing. In: ISMIR. vol. 1, p. 6 (2019)
2019
-
[80]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Van Wouwe, T., Lee, S., Falisse, A., Delp, S., Liu, C.K.: Diffusionposer: Real-time human motion reconstruction from arbitrary sparse sensors using autoregressive diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2513–2523 (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.