arxiv: 2604.21926 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Alexander G. Schwing, Hao-Yu Hsu, Jing Wen, Shenlong Wang, Tianhang Cheng

Pith reviewed 2026-05-09 21:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords wearable IMU4D reconstructionhuman motionscene understandinglarge language modelsnon-visual perceptionspatiotemporal modeling

0 comments

The pith

Wearable inertial sensors enable 4D human motion and scene layout reconstruction by repurposing large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that rich 4D understanding of humans and their environments is achievable without any cameras or visual input, relying solely on data from a few wearable inertial sensors. It does so by introducing the IMU-to-4D framework that adapts large language models to translate sparse motion readings into detailed time-varying human poses and coarse 3D scene structures. A sympathetic reader would care because this sidesteps persistent camera drawbacks including privacy risks, high energy demands, and limited scalability for continuous monitoring. If correct, the work indicates that everyday devices already equipped with such sensors could deliver coherent spatiotemporal perception.

Core claim

IMU-to-4D is a framework that repurposes large language models to predict detailed 4D human motion together with coarse scene structure directly from signals of a few inertial sensors, yielding more coherent and temporally stable outputs than state-of-the-art cascaded pipelines across diverse human-scene datasets.

What carries the argument

The IMU-to-4D framework, which repurposes large language models to perform non-visual spatiotemporal reasoning that jointly infers human motion dynamics and rough 3D scene layouts from sparse IMU data.

If this is right

Wearable devices can deliver coherent 4D perception without relying on cameras or visual processing.
Motion data alone supports inference of coarse 3D scene layouts in addition to human poses.
Outputs maintain higher temporal stability than cascaded visual pipelines across varied datasets.
Applications become viable in privacy-sensitive or energy-constrained settings where cameras are unsuitable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend naturally to real-time operation on consumer-grade hardware if model inference costs are managed.
Combining IMU signals with minimal additional non-visual cues could resolve ambiguities in scene prediction.
Similar repurposing of language models might apply to inferring object interactions or human intent from motion patterns alone.

Load-bearing premise

Large language models can map sparse non-visual inertial signals to accurate detailed human motion and scene structure without visual supervision or explicit geometric priors.

What would settle it

A test set of IMU recordings from identical human motions performed inside structurally different scenes, with evaluation of whether the predicted scene layouts distinguish the environments correctly.

Figures

Figures reproduced from arXiv: 2604.21926 by Alexander G. Schwing, Hao-Yu Hsu, Jing Wen, Shenlong Wang, Tianhang Cheng.

**Figure 1.** Figure 1: IMU-to-4D reconstructs full 4D human–scene dynamics from only a few everyday wearable IMUs. Given sparse inertial signals from devices such as earbuds, smartphones, or watches, our method predicts SMPL-X body motion, generates textual activity descriptions, and recovers a coarse 3D scene layout with object identities. can everyday motion sensors embedded in earbuds, watches, and smartphones reveal about … view at source ↗

**Figure 2.** Figure 2: Overview of IMU-to-4D. (Left) Motion Tokenizer. The root trajectory is chunked into fixed-length windows and normalized, where normalized chunks are VQquantized and normalization parameters µ and σ are separately quantized via nonuniform binning, together yielding compact discrete root tokens. Body poses are continuously encoded via an MLP to produce body tokens. (Right) Multi-modal Transformer. Inerti… view at source ↗

**Figure 3.** Figure 3: Comparison of motion tokenizers and their integration with LLMs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: IMU-to-Motion comparison on MotionMillions. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of scene prediction on Humoto. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Ambiguity of scene prediction given IMU signals as inputs. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on IMU-to-3D scene prediction. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: IMU-to-Motion and relocalization on ParaHome [ [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces IMU-to-4D, a framework that repurposes large language models to reconstruct detailed 4D human motion together with coarse 3D scene layouts directly from sparse wearable IMU signals (earbuds, watches, smartphones), claiming that this yields more coherent and temporally stable results than state-of-the-art cascaded pipelines across diverse human-scene datasets.

Significance. If the empirical superiority holds with proper validation, the work would be significant for enabling privacy-preserving, low-energy 4D human-scene understanding without cameras. However, the absence of any quantitative metrics, ablation studies, or implementation details prevents assessment of whether the result actually demonstrates non-visual scene inference or merely reflects memorized correlations.

major comments (2)

[Abstract] Abstract: the central empirical claim that IMU-to-4D 'yields more coherent and temporally stable results than SoTA cascaded pipelines' supplies no quantitative metrics (e.g., no coherence scores, temporal stability measures, or numerical comparison values), no tables, and no ablation studies, so the superiority cannot be evaluated.
[Abstract] Abstract / Method description: IMU signals provide only local acceleration and angular velocity with no direct scene geometry; the manuscript gives no information on (a) IMU tokenization for the LLM, (b) whether scene supervision derives from visual ground truth or motion alone, or (c) any test-time ablation removing visual cues, which is required to attribute results to non-visual 4D understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We agree that the abstract and method sections would benefit from more explicit quantitative support and implementation details to allow readers to fully evaluate the claims. We will revise the manuscript to incorporate these elements.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that IMU-to-4D 'yields more coherent and temporally stable results than SoTA cascaded pipelines' supplies no quantitative metrics (e.g., no coherence scores, temporal stability measures, or numerical comparison values), no tables, and no ablation studies, so the superiority cannot be evaluated.

Authors: We acknowledge that the abstract as currently written summarizes the empirical findings without specific numbers. The experiments section of the manuscript contains quantitative comparisons, tables, and ablation studies on coherence and temporal stability metrics across datasets. To address the concern directly, we will revise the abstract to include representative numerical results and explicit references to the supporting tables and ablations. revision: yes
Referee: [Abstract] Abstract / Method description: IMU signals provide only local acceleration and angular velocity with no direct scene geometry; the manuscript gives no information on (a) IMU tokenization for the LLM, (b) whether scene supervision derives from visual ground truth or motion alone, or (c) any test-time ablation removing visual cues, which is required to attribute results to non-visual 4D understanding.

Authors: We agree these details are necessary to substantiate the non-visual claim. We will expand the method section in the revision to fully describe (a) the IMU tokenization procedure used to prepare signals for the LLM, (b) the source of scene supervision during training, and (c) an ablation that isolates performance when visual cues are removed. This will allow clear attribution to IMU-based inference. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivation chain

full rationale

The paper introduces IMU-to-4D as an empirical framework that repurposes LLMs to map wearable IMU signals to 4D human motion and coarse scene structure. All central claims rest on experimental comparisons across human-scene datasets showing improved coherence over cascaded pipelines. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described approach. The result is presented as data-driven evidence rather than a mathematical reduction to its own inputs, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, background axioms, or newly postulated entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5450 in / 1024 out tokens · 28829 ms · 2026-05-09T21:55:53.182769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

123 extracted references · 27 canonical work pages · 13 internal anchors

[1]

In: Proceedings of the 16 H

Araújo, J.P., Li, J., Vetrivel, K., Agarwal, R., Wu, J., Gopinath, D., Clegg, A.W., Liu, K.: Circle: Capture in rich contextual environments. In: Proceedings of the 16 H. Hsu, T. Cheng et al. IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21211– 21221 (2023)

2023
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bhatnagar, B.L., Xie, X., Petrov, I.A., Sminchisescu, C., Theobalt, C., Pons- Moll, G.: Behave: Dataset and method for tracking human object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15935–15946 (2022)

2022
[4]

In: 2024 International Conference on 3D Vision (3DV)

Braun, J., Christen, S., Kocabas, M., Aksan, E., Hilliges, O.: Physically plausible full-body hand-object interaction synthesis. In: 2024 International Conference on 3D Vision (3DV). pp. 464–473. IEEE (2024)

2024
[5]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Castillo, A., Escobar, M., Jeanneret, G., Pumarola, A., Arbeláez, P., Thabet, A., Sanakoyeu, A.: Bodiffusion: Diffusing sparse observations for full-body human motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4221–4231 (2023)

2023
[6]

Motionllm: Understanding human behaviors from human motions and videos.arXiv preprint arXiv:2405.20340, 2024

Chen, L.H., Lu, S., Zeng, A., Zhang, H., Wang, B., Zhang, R., Zhang, L.: Mo- tionllm: Understanding human behaviors from human motions and videos. arXiv preprint arXiv:2405.20340 (2024)

work page arXiv 2024
[7]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)

work page internal anchor Pith review arXiv 2025
[8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review arXiv 2024
[9]

In: Proceedings of the IEEE/CVF international conference on computer vision

Chung, J., Wuu, C.h., Yang, H.r., Tai, Y.W., Tang, C.K.: Haa500: Human-centric atomic action dataset with curated videos. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13465–13474 (2021)

2021
[10]

IEEE transactions on pattern analysis and machine intelligence (2024)

Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: Linking 3d human poses and natural language. IEEE transactions on pattern analysis and machine intelligence (2024)

2024
[11]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

work page internal anchor Pith review arXiv 2025
[12]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectionaltransformersforlanguageunderstanding.In:Proceedingsofthe2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

2019
[13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Dittadi, A., Dziadzio, S., Cosker, D., Lundell, B., Cashman, T.J., Shotton, J.: Full-body motion from a single head-mounted device: Generating smpl poses from partial observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11687–11697 (2021)

2021
[14]

In: Proceedings of the IEEE/CVFconferenceoncomputervisionandpatternrecognition.pp.1323–1333 (2024)

Dwivedi, S.K., Sun, Y., Patel, P., Feng, Y., Black, M.J.: Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. In: Proceedings of the IEEE/CVFconferenceoncomputervisionandpatternrecognition.pp.1323–1333 (2024)

2024
[15]

In: Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 17 Proceedings of the IEEE/CVF International Conference on Computer Vision

Fan, K., Lu, S., Dai, M., Yu, R., Xiao, L., Dou, Z., Dong, J., Ma, L., Wang, J.: Go to zero: Towards zero-shot motion generation with million-scale data. In: Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 17 Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13336–13348 (2025)

2025
[16]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: Chat- ting about 3d human pose. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 2093–2103 (2024)

2093
[17]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Fieraru, M., Zanfir, M., Oneata, E., Popa, A.I., Olaru, V., Sminchisescu, C.: Learning complex 3d human self-contact. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1343–1351 (2021)

2021
[18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fieraru, M., Zanfir, M., Pirlea, S.C., Olaru, V., Sminchisescu, C.: Aifit: Automatic 3d human-interpretable feedback models for fitness training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9919– 9928 (2021)

2021
[19]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation

Ge, Y., Zhao, S., Zhu, J., Ge, Y., Yi, K., Song, L., Li, C., Ding, X., Shan, Y.: Seed- x: Multimodal models with unified multi-granularity comprehension and genera- tion. arXiv preprint arXiv:2404.14396 (2024)

work page arXiv 2024
[20]

In: Computer Graphics Forum

Ghosh, A., Dabral, R., Golyanik, V., Theobalt, C., Slusallek, P.: Imos: Intent- driven full-body motion synthesis for human-object interactions. In: Computer Graphics Forum. vol. 42, pp. 1–12. Wiley Online Library (2023)

2023
[21]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Nov 2024)

Ghosh, S., Kumar, S., Seth, A., Evuru, C.K.R., Tyagi, U., Sakshi, S., Nieto, O., Duraiswami, R., Manocha, D.: GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Nov 2024)

2024
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)

2024
[23]

Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025

Guo, C., Hwang, I., Wang, J., Zhou, B.: Snapmogen: Human motion generation from expressive texts. arXiv preprint arXiv:2507.09122 (2025)

work page arXiv 2025
[24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5152–5161 (2022)

2022
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (hps): 3dhumanposeestimationandself-localizationinlarge scenes frombody-mounted sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4318–4329 (2021)

2021
[26]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M.J.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11374–11384 (2021)

2021
[27]

In: Proceedings of the IEEE/CVF inter- national conference on computer vision

Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambiguities with 3d scene constraints. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 2282–2292 (2019)

2019
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hassan,M.,Ghosh,P.,Tesch,J.,Tzionas,D.,Black,M.J.:Populating3dscenesby learning human-scene interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14708–14718 (2021)

2021
[29]

In: ACM SIGGRAPH 2023 Conference Proceedings

Hassan, M., Guo, Y., Wang, T., Black, M., Fidler, S., Peng, X.B.: Synthesiz- ing physical character-scene interactions. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–9 (2023) 18 H. Hsu, T. Cheng et al

2023
[30]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Hong, F., Guzov, V., Kim, H.J., Ye, Y., Newcombe, R., Liu, Z., Ma, L.: Egolm: Multi-modal language model of egocentric motions. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5344–5354 (2025)

2025
[31]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., Zhu, S.C.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16750–16761 (2023)

2023
[32]

ACM Transactions on Graphics (TOG)37(6), 1–15 (2018)

Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics (TOG)37(6), 1–15 (2018)

2018
[33]

arXiv preprint arXiv:2503.16289 (2025)

Hwang, I., Zhou, B., Kim, Y.M., Wang, J., Guo, C.: Scenemi: Mo- tion in-betweening for modeling human-scene interactions. arXiv preprint arXiv:2503.16289 (2025)

work page arXiv 2025
[34]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Es- mail,A.,Equi,M.,Finn,C.,Fusai,N.,etal.:π0.5:avision-language-actionmodel with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Advances in Neural Information Processing Systems36, 20067–20079 (2023)

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36, 20067–20079 (2023)

2023
[36]

In: Euro- pean conference on computer vision

Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatar- poser: Articulated full-body pose tracking from sparse motion sensing. In: Euro- pean conference on computer vision. pp. 443–460. Springer (2022)

2022
[37]

In: SIGGRAPH Asia 2024 Conference Papers

Jiang, N., He, Z., Wang, Z., Li, H., Chen, Y., Huang, S., Zhu, Y.: Autonomous character-scene interaction synthesis from text instruction. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, N., Zhang, Z., Li, H., Ma, X., Wang, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Scaling up dynamic human-scene interaction modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1737– 1747 (2024)

2024
[39]

In: SIGGRAPH Asia 2022 Conference Papers

Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. In: SIGGRAPH Asia 2022 Conference Papers. pp. 1–9 (2022)

2022
[40]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Kim, J., Kim, J., Na, J., Joo, H.: Parahome: Parameterizing everyday home activ- ities towards 3d generative modeling of human-object interactions. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 1816–1828 (2025)

2025
[41]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review arXiv 2024
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kulkarni,N.,Rempe,D.,Genova,K.,Kundu,A.,Johnson,J.,Fouhey,D.,Guibas, L.: Nifty: Neural object interaction fields for guided human motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 947–957 (2024)

2024
[43]

In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

Lee, J., Joo, H.: Locomotion-action-manipulation: Synthesizing human-scene in- teractions in complex 3d environments. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 9663–9674 (2023) Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 19

2023
[44]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lee, J., Joo, H.: Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head-mounted camera. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1091–1100 (2024)

2024
[45]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, B., Zhao, Y., Zhelun, S., Sheng, L.: Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 1272–1279 (2022)

2022
[46]

In: European Conference on Computer Vision

Li, J., Clegg, A., Mottaghi, R., Wu, J., Puig, X., Liu, C.K.: Controllable human- object interaction synthesis. In: European Conference on Computer Vision. pp. 54–72. Springer (2024)

2024
[47]

ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

Li, J., Wu, J., Liu, C.K.: Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

2023
[48]

In: ACM SIGGRAPH 2024 Conference Papers

Li, J., Huang, T., Zhu, Q., Wong, T.T.: Physics-based scene layout generation from human motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–10 (2024)

2024
[49]

Genmo: A generalist model for human motion.arXiv preprint arXiv: 2505.01425,

Li, J., Cao, J., Zhang, H., Rempe, D., Kautz, J., Iqbal, U., Yuan, Y.: Genmo: A generalist model for human motion. arXiv preprint arXiv:2505.01425 (2025)

work page arXiv 2025
[50]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Li,R.,Zhao,J.,Zhang,Y.,Su,M.,Ren,Z.,Zhang,H.,Tang,Y.,Li,X.:Finedance: A fine-grained choreography dataset for 3d full body dance generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10234–10243 (2023)

2023
[51]

Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., Lee, Y.T.: Text- booksareallyouneedii:phi-1.5technicalreport.arXivpreprintarXiv:2309.05463 (2023)

work page internal anchor Pith review arXiv 2023
[52]

International Journal of Computer Vision132(9), 3463–3483 (2024)

Liang, H., Zhang, W., Li, W., Yu, J., Xu, L.: Intergen: Diffusion-based multi- human motion generation under complex interactions. International Journal of Computer Vision132(9), 3463–3483 (2024)

2024
[53]

In: Text summarization branches out

Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)

2004
[54]

Advances in Neural Information Processing Systems36, 25268–25280 (2023)

Lin, J., Zeng, A., Lu, S., Cai, Y., Zhang, R., Wang, H., Zhang, L.: Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems36, 25268–25280 (2023)

2023
[55]

World model on million-length video and language with ringattention

Liu, H., Yan, W., Zaharia, M., Abbeel, P.: World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 (2024)

work page arXiv 2024
[56]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

2023
[57]

48550/arXiv.2504.10414

Lu, J., Huang, C.H.P., Bhattacharya, U., Huang, Q., Zhou, Y.: Humoto: A 4d dataset of mocap human object interactions. arXiv preprint arXiv:2504.10414 (2025)

work page arXiv 2025
[58]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lu, S., Wang, J., Lu, Z., Chen, L.H., Dai, W., Dong, J., Dou, Z., Dai, B., Zhang, R.: Scamo: Exploring the scaling law in autoregressive motion generation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27872–27882 (2025)

2025
[59]

In: Proceedings of the IEEE/CVF international conference on computer vision

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5442–5451 (2019)

2019
[60]

ACM Transactions on Graphics (TOG)39(4), 39–1 (2020) 20 H

Merel, J., Tunyasuvunakool, S., Ahuja, A., Tassa, Y., Hasenclever, L., Pham, V., Erez, T., Wayne, G., Heess, N.: Catch & carry: reusable neural controllers for vision-guided whole-body tasks. ACM Transactions on Graphics (TOG)39(4), 39–1 (2020) 20 H. Hsu, T. Cheng et al

2020
[61]

In: 2024 International Conference on 3D Vision (3DV)

Mir, A., Puig, X., Kanazawa, A., Pons-Moll, G.: Generating continual human mo- tion in diverse 3d scenes. In: 2024 International Conference on 3D Vision (3DV). pp. 903–913. IEEE (2024)

2024
[62]

Computer vision and image understanding81(3), 231–268 (2001)

Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. Computer vision and image understanding81(3), 231–268 (2001)

2001
[63]

In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Mollyn, V., Arakawa, R., Goel, M., Harrison, C., Ahuja, K.: Imuposer: Full-body pose estimation using imus in phones, watches, and earbuds. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–12 (2023)

2023
[64]

In: European Conference on Computer Vision

Nie, Y., Dai, A., Han, X., Nießner, M.: Pose2room: understanding 3d scenes from human activities. In: European Conference on Computer Vision. pp. 425–443. Springer (2022)

2022
[65]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

2002
[66]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision

Patel,C.,Nakamura,H.,Kyuragi,Y.,Kozuka,K.,Niebles,J.C.,Adeli,E.:Uniego- motion: A unified model for egocentric motion reconstruction, forecasting, and generation. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 10318–10329 (2025)

2025
[67]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)

2019
[68]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

work page internal anchor Pith review arXiv 2025
[69]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Petrov, I.A., Marin, R., Chibane, J., Pons-Moll, G.: Tridi: Trilateral diffusion of 3d humans, objects, and interactions. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 5523–5535 (2025)

2025
[70]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., Torralba, A.: Virtual- home: Simulating household activities via programs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8494–8502 (2018)

2018
[71]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: Babel: Bodies, action and behavior with english labels. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 722–731 (2021)

2021
[72]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2545–2555 (2025)

2025
[73]

OpenAI blog1(8), 9 (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Lan- guage models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

2019
[74]

Xsens Technol1(8) (2018)

Schepers, M., Giuberti, M., Bellusci, G., et al.: Xsens mvn: Consistent tracking of human motion using inertial sensing. Xsens Technol1(8) (2018)

2018
[75]

arXiv preprint arXiv:2503.14919 (2025)

Shi, J., Liu, L., Sun, Y., Zhang, Z., Zhou, J., Nie, Q.: Genm3: Generative pre- trained multi-path motion model for text conditional human motion generation. arXiv preprint arXiv:2503.14919 (2025)

work page arXiv 2025
[76]

sony.net/Products/mocopi-dev/en/, accessed: November 2025

Sony Corporation: mocopi - mobile motion capture system (2025),https://www. sony.net/Products/mocopi-dev/en/, accessed: November 2025

2025
[77]

arXiv preprint arXiv:2402.16412 (2024) Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 21

Talukder, S., Yue, Y., Gkioxari, G.: Totem: Tokenized time series embeddings for general time series analysis. arXiv preprint arXiv:2402.16412 (2024) Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs 21

work page arXiv 2024
[78]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.098189(8) (2024)

work page internal anchor Pith review arXiv 2024
[79]

In: ISMIR

Tsuchida, S., Fukayama, S., Hamasaki, M., Goto, M.: Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information pro- cessing. In: ISMIR. vol. 1, p. 6 (2019)

2019
[80]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Van Wouwe, T., Lee, S., Falisse, A., Delp, S., Liu, C.K.: Diffusionposer: Real-time human motion reconstruction from arbitrary sparse sensors using autoregressive diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2513–2523 (2024)

2024

Showing first 80 references.