Recognition: 2 theorem links
· Lean TheoremControllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
Pith reviewed 2026-05-15 12:15 UTC · model grok-4.3
The pith
Sparse 3D hand joints with occlusion-aware weighting generate controllable egocentric videos from one reference frame.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Leveraging sparse 3D hand joints as control signals, the framework extracts occlusion-aware features from the reference frame by penalizing hidden joints and employs a 3D-based weighting mechanism to handle dynamically occluded target joints, while directly injecting 3D geometric embeddings into the latent space to enforce consistency, yielding high-fidelity egocentric videos with realistic interactions and cross-embodiment generalization.
What carries the argument
The occlusion-aware control module that penalizes unreliable visual signals from occluded joints, applies 3D weighting for motion propagation, and injects geometric embeddings into the latent space.
If this is right
- Enables fine-grained 3D-consistent hand articulation in generated egocentric videos.
- Supports generalization from human to robotic hand embodiments without retraining.
- Reduces hallucinated artifacts in regions with severe self-occlusion.
- Provides an automated pipeline for creating large-scale paired video-trajectory datasets.
Where Pith is reading between the lines
- The same sparse-joint injection approach could be tested on full-body egocentric motion by extending the control module to additional keypoints.
- Longer video sequences might require an explicit temporal consistency loss on the 3D embeddings to maintain coherence beyond short clips.
- The occlusion penalization could be applied to other camera viewpoints, such as third-person views, to check if the 3D structure remains the dominant signal.
Load-bearing premise
Sparse 3D hand joints plus the occlusion-aware weighting supply enough geometric and semantic information to prevent motion inconsistencies without additional human-centric priors.
What would settle it
Generate a video sequence where hand joints are heavily occluded in the reference frame; if the output shows inconsistent finger articulation or 3D depth errors compared to ground-truth trajectories, the claim fails.
Figures
read the original abstract
Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a framework for generating controllable egocentric videos from a single reference frame by using sparse 3D hand joints as embodiment-agnostic control signals. It proposes an occlusion-aware control module that penalizes unreliable visual signals from hidden joints, applies 3D-based weighting during motion propagation, and injects 3D geometric embeddings into the latent space. The work also presents an automated pipeline yielding over one million annotated egocentric video clips and a cross-embodiment benchmark by registering humanoid kinematic data, with experimental results asserting significant outperformance over state-of-the-art baselines in fidelity, realistic interactions, and generalization to robotic hands.
Significance. If the central claims hold, the work would advance motion-controllable video generation for egocentric settings in VR and embodied AI by reducing reliance on 2D trajectories or human-centric priors. The large-scale dataset and cross-embodiment benchmark could serve as useful resources for future evaluation, provided they include reproducible baselines and metrics.
major comments (3)
- [§3.2] §3.2: The occlusion-aware feature extraction and 3D-based weighting mechanism are described at a high level, but no explicit equations or pseudocode detail how the penalization of hidden joints is computed or how the weighting is applied during propagation; without this, it is difficult to verify whether the module supplies sufficient geometric constraints to prevent the diffusion backbone from defaulting to learned priors under severe egocentric occlusions.
- [§5.3] §5.3, Table 3: The reported outperformance on the cross-embodiment benchmark for robotic hands is presented without ablation isolating the contribution of the sparse 3D joint representation versus the occlusion module; the quantitative gains could be confounded by differences in training data distribution rather than the claimed 3D consistency enforcement.
- [§4.1] §4.1: The assumption that sparse 3D joints plus reference-frame feature extraction resolve self-occlusion and out-of-frame cases is load-bearing for the high-fidelity interaction and generalization claims, yet the paper provides no failure-case analysis or comparison against methods that incorporate additional human-centric priors to test this directly.
minor comments (2)
- [Abstract] The abstract and introduction use the phrase 'exceptional cross-embodiment generalization' without defining the metric or threshold used to support this adjective.
- [Figure 4] Figure 4 caption refers to 'qualitative results' but does not specify the exact input conditions (e.g., degree of occlusion) for each row, reducing reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are needed for clarity or additional analysis, we will incorporate them in the revised manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2: The occlusion-aware feature extraction and 3D-based weighting mechanism are described at a high level, but no explicit equations or pseudocode detail how the penalization of hidden joints is computed or how the weighting is applied during propagation; without this, it is difficult to verify whether the module supplies sufficient geometric constraints to prevent the diffusion backbone from defaulting to learned priors under severe egocentric occlusions.
Authors: We agree that the description in §3.2 would benefit from greater mathematical precision. In the revised manuscript we will add explicit equations defining the occlusion penalization term applied to hidden joints during feature extraction, the 3D-based weighting function used in motion propagation, and the injection of geometric embeddings into the latent space. We will also include pseudocode for the full control module to allow direct verification that the geometric constraints are sufficient to mitigate reliance on learned priors under egocentric occlusion. revision: yes
-
Referee: [§5.3] §5.3, Table 3: The reported outperformance on the cross-embodiment benchmark for robotic hands is presented without ablation isolating the contribution of the sparse 3D joint representation versus the occlusion module; the quantitative gains could be confounded by differences in training data distribution rather than the claimed 3D consistency enforcement.
Authors: We acknowledge that an explicit ablation isolating the occlusion module on the robotic-hand benchmark would strengthen the claims. While the sparse 3D representation itself is embodiment-agnostic and central to cross-embodiment generalization, we will add a controlled ablation in the revision that trains variants with and without the occlusion-aware components on identical data distributions and reports results on the same robotic-hand test set. This will clarify the incremental contribution of the occlusion handling. revision: yes
-
Referee: [§4.1] §4.1: The assumption that sparse 3D joints plus reference-frame feature extraction resolve self-occlusion and out-of-frame cases is load-bearing for the high-fidelity interaction and generalization claims, yet the paper provides no failure-case analysis or comparison against methods that incorporate additional human-centric priors to test this directly.
Authors: We agree that a dedicated failure-case analysis would provide stronger evidence for the load-bearing assumption. Although the current experiments include challenging egocentric sequences, we will add a new subsection and accompanying figure in the revision that systematically examines failure modes for severe self-occlusion and out-of-frame hands. We will also include direct comparisons against representative baselines that rely on additional human-centric priors to highlight where the sparse 3D approach succeeds or remains limited. revision: yes
Circularity Check
No circularity: new control module and dataset are independent contributions
full rationale
The paper presents a novel framework that extracts occlusion-aware features from sparse 3D hand joints and injects 3D geometric embeddings into a diffusion backbone. No equations, derivations, or self-citations are shown that reduce the claimed 3D consistency, high-fidelity interactions, or cross-embodiment generalization to quantities defined by the method's own fitted parameters or prior self-referential results. The automated annotation pipeline and registered humanoid benchmark are new data contributions, and performance claims rest on empirical comparisons to external baselines rather than any self-definitional loop or fitted-input-as-prediction pattern. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse 3D hand joints provide embodiment-agnostic control signals with clear semantic and geometric structures sufficient to resolve occlusion ambiguities.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
3D-based weighting mechanism... Ai,t(x)=softmax i (log(Mi,t(x)+ϵ)+λ·di,t)... 3D geometric embeddings zi,t=ϕ([γ(ui,t,di,t);Eid[i]])
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
occlusion-aware motion feature... penalizing unreliable visual signals from hidden joints
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
World Simulation with Video Foundation Models for Physical AI
Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
arXiv preprint arXiv:2601.15284 (2026) 5
Bagchi, A., Bao, Z., Bharadhwaj, H., Wang, Y.X., Tokmakov, P., Hebert, M.: Walk through paintings: Egocentric world models from internet priors. arXiv preprint arXiv:2601.15284 (2026) 5
-
[3]
Motus: A Unified Latent Action World Model
Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., Zhao, H., Liu, H., Su, Z., Ma, L., Su, H.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025) 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Burgert, R., Xu, Y., Xian, W., Pilarski, O., Clausen, P., He, M., Ma, L., Deng, Y., Li, L., Mousavi, M., Ryoo, M., Debevec, P., Yu, N.: Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. In: CVPR (2025) 4
work page 2025
-
[6]
In: NeurIPS (2025) 2, 4, 6, 7, 8, 14, 15, 16, 19, 20
Chu, R., He, Y., Chen, Z., Zhang, S., Xu, X., Xia, B., Wang, D., Yi, H., Liu, X., Zhao, H., et al.: Wan-move: Motion-controllable video generation via latent trajectory guidance. In: NeurIPS (2025) 2, 4, 6, 7, 8, 14, 15, 16, 19, 20
work page 2025
-
[7]
Fu, X., Liu, X., Wang, X., Peng, S., Xia, M., Shi, X., Yuan, Z., Wan, P., Zhang, D., Lin, D.: 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. In: ICLR (2024) 4
work page 2024
-
[8]
Em- bodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025
Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., Jégou, H., Lazaric, A., et al.: Embodied ai agents: Modeling the world. arXiv preprint arXiv:2506.22355 (2025) 5
-
[9]
Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.C., Dong, Y., Mo, K., Lin, C.H., Ma, Q., Nah, S., et al.: Dreamdojo: A generalist robot world model from large-scale human videos. arXiv preprint arXiv:2602.06949 (2026) 5
- [10]
-
[11]
Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y., Rubinstein, M., Sun, C., et al.: Motion prompting: Controlling video generation with motion trajectories. In: CVPR (2025) 2, 4
work page 2025
- [12]
-
[13]
Training Agents Inside of Scalable World Models
Hafner, D., Yan, W., Lillicrap, T.: Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527 (2025) 5 22 Zhang et al
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [14]
-
[15]
Hassan, M., Stapf, S., Rahimi, A., Rezende, P., Haghighi, Y., Brüggemann, D., Katircioglu, I., Zhang, L., Chen, X., Saha, S.: Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In: CVPR. pp. 22404–22415 (2025) 4, 5
work page 2025
-
[16]
Advances in neural information processing systems30(2017) 13
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 13
work page 2017
-
[17]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Hoque, R., Huang, P., Yoon, D.J., Sivapurapu, M., Zhang, J.: Egodex: Learn- ing dexterous manipulation from large-scale egocentric video. arXiv preprint arXiv:2505.11709 (2025) 4, 13, 15, 20
work page internal anchor Pith review arXiv 2025
-
[18]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. In: ICLR (2022) 9
work page 2022
-
[19]
Jain, Y., Nasery, A., Vineet, V., Behl, H.: Peekaboo: Interactive video generation via masked-diffusion. In: CVPR (2024) 4
work page 2024
-
[20]
arXiv preprint arXiv:2509.15212 (2025) 5
Jiang, Y., Huang, S., Xue, S., Zhao, Y., Cen, J., Leng, S., Li, K., Guo, J., Wang, K., Chen, M., Wang, F., Zhao, D., Li, X.: Rynnvla-001: Using human demonstrations to improve robot manipulation. arXiv preprint arXiv:2509.15212 (2025) 5
-
[21]
Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling. In: ICLR (2025) 1
work page 2025
-
[22]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
arXiv preprint arXiv:2512.02015 (2025) 2
Lee, Y.C., Zhang, Z., Huang, J., Wang, J.H., Lee, J.Y., Huang, J.B., Shechtman, E., Li, Z.: Generative video motion editing with 3d point tracks. arXiv preprint arXiv:2512.02015 (2025) 2
-
[24]
arXiv preprint arXiv:2510.03135 (2025) 2, 14, 15
Li, G., Zhao, B., Yang, J., Sevilla-Lara, L.: Mask2iv: Interaction-centric video generation via mask trajectories. arXiv preprint arXiv:2510.03135 (2025) 2, 14, 15
-
[25]
arXiv preprint arXiv:2503.16421 (2025) 4
Li, Q., Xing, Z., Wang, R., Zhang, H., Dai, Q., Wu, Z.: Magicmotion: Control- lable video generation with dense-to-sparse trajectory guidance. arXiv preprint arXiv:2503.16421 (2025) 4
-
[26]
Li, Y., Wang, X., Zhang, Z., Wang, Z., Yuan, Z., Xie, L., Shan, Y., Zou, Y.: Image conductor: Precision control for interactive video synthesis. In: AAAI (2025) 4
work page 2025
-
[27]
arXiv preprint arXiv:2509.13903 (2025) 5
Lykov, A., Sam, J., Nguyen, H.K., Kozlovskiy, V., Mahmoud, Y., Serpiva, V., Cabrera, M.A., Konenkov, M., Tsetserukou, D.: Physicalagent: Towards general cognitive robotics with foundation world models. arXiv preprint arXiv:2509.13903 (2025) 5
-
[28]
In: ACM SIGGRAPH Asia (2024) 4
Ma,W.D.K.,Lewis,J.P.,Kleijn,W.B.:Trailblazer:Trajectorycontrolfordiffusion- based video generation. In: ACM SIGGRAPH Asia (2024) 4
work page 2024
-
[29]
arXiv preprint arXiv:2308.10901 (2023) 5
Mendonca, R., Bahl, S., Pathak, D.: Structured world models from human videos. arXiv preprint arXiv:2308.10901 (2023) 5
-
[30]
Namekata, K., Bahmani, S., Wu, Z., Kant, Y., Gilitschenski, I., Lindell, D.B.: Sg- i2v: Self-guided trajectory control in image-to-video generation. In: ICLR (2025) 4
work page 2025
-
[31]
arXiv preprint arXiv:2411.19548 (2024) 5
Ni, C., Zhao, G., Wang, X., Zhu, Z., Qin, W., Huang, G., Liu, C., Chen, Y., Wang, Y., Zhang, X., Zhan, Y., Zhan, K., Jia, P., Lang, X., Wang, X., Mei, W.: Title Suppressed Due to Excessive Length 23 Recondreamer: Crafting world models for driving scene reconstruction via online restoration. arXiv preprint arXiv:2411.19548 (2024) 5
-
[32]
PAI, A.: Wan2.1-fun control video generation models. HuggingFace Model Collec- tion (2025),https://huggingface.co/collections/alibaba- pai/wan21- fun- v112, 9, 14, 15, 16, 19
work page 2025
-
[33]
arXiv preprint arXiv:2511.18173 (2025) 2, 5, 9
Pallotta, E., Azar, S.M., Doorenbos, L., Ozsoy, S., Iqbal, U., Gall, J.: Egocontrol: Controllable egocentric video generation via 3d full-body poses. arXiv preprint arXiv:2511.18173 (2025) 2, 5, 9
- [34]
-
[35]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Potamias, R.A., Zhang, J., Deng, J., Zafeiriou, S.: Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12242–12254 (2025) 10, 13
work page 2025
-
[36]
In: European Conference on Computer Vision
Qin, Y., Wu, Y.H., Liu, S., Jiang, H., Yang, R., Fu, Y., Wang, X.: Dexmv: Imitation learning for dexterous manipulation from human videos. In: European Conference on Computer Vision. pp. 570–587. Springer (2022) 3, 17
work page 2022
-
[37]
arXiv preprint arXiv:2406.16863 (2024) 4
Qiu, H., Chen, Z., Wang, Z., He, Y., Xia, M., Liu, Z.: Freetraj: Tuning-free tra- jectory control in video diffusion models. arXiv preprint arXiv:2406.16863 (2024) 4
-
[38]
YOLOv3: An Incremental Improvement
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 10
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [39]
-
[40]
arXiv preprint arXiv:2201.02610 (2022) 10
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022) 10
-
[41]
arXiv preprint arXiv:2602.10106 (2026) 5
Shi, M., Peng, S., Chen, J., Jiang, H., Li, Y., Huang, D., Luo, P., Li, H., Chen, L.: Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106 (2026) 5
-
[42]
Shi, X., Huang, Z., Wang, F.Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K.C., See, S., Qin, H., et al.: Motion-i2v: Consistent and controllable image-to- video generation with explicit motion modeling. In: ACM SIGGRAPH (2024) 4
work page 2024
- [43]
-
[44]
arXiv preprint arXiv:2506.09995 (2025) 2, 5
Tu, Y., Luo, H., Chen, X., Bai, X., Wang, F., Zhao, H.: Playerone: Egocentric world simulator. arXiv preprint arXiv:2506.09995 (2025) 2, 5
-
[45]
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019) 13
work page 2019
-
[46]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1, 4, 5, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
arXiv preprint arXiv:2505.22944 (2025) 4
Wang, A., Huang, H., Fang, J.Z., Yang, Y., Ma, C.: Ati: Any trajectory instruction for controllable video generation. arXiv preprint arXiv:2505.22944 (2025) 4
-
[48]
arXiv preprint arXiv:2503.24026 (2025) 5
Wang, B., Wang, X., Ni, C., Zhao, G., Yang, Z., Zhu, Z., Zhang, M., Zhou, Y., Chen, X., Huang, G., Liu, L., Wang, X.: Humandreamer: Generating controllable human-motion videos via decoupled generation. arXiv preprint arXiv:2503.24026 (2025) 5
-
[49]
arXiv preprint arXiv:2502.08639 (2025) 4 24 Zhang et al
Wang, Q., Luo, Y., Shi, X., Jia, X., Lu, H., Xue, T., Wang, X., Wan, P., Zhang, D., Gai, K.: Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. arXiv preprint arXiv:2502.08639 (2025) 4 24 Zhang et al
-
[50]
Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion control- lability. In: NeurIPS (2023) 4
work page 2023
-
[51]
arXiv preprint arXiv:2409.19911 (2024) 4
Wang, X., Zhang, S., Qiu, H., Chu, R., Li, Z., Zhang, Y., Gao, C., Wang, Y., Shen, C., Sang, N.: Replace anyone in videos. arXiv preprint arXiv:2409.19911 (2024) 4
-
[52]
arXiv preprint arXiv:2508.13104 (2025) 5
Wang, Y., Wen, C., Guo, H., Peng, S., Qin, M., Bao, H., Zhou, X., Hu, R.: Precise action-to-video generation through visual action prompts. arXiv preprint arXiv:2508.13104 (2025) 5
-
[53]
arXiv preprint arXiv:2602.09600 (2026) 5
Wang, Y., Ouyang, W., Wei, T., Dong, Y., Shen, Z., Pan, X.: Hand2world: Au- toregressive egocentric interaction generation via free-space hand gestures. arXiv preprint arXiv:2602.09600 (2026) 5
-
[54]
IEEE transactions on image processing 13(4), 600–612 (2004) 13
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 13
work page 2004
-
[55]
Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH. pp. 1–11 (2024) 2, 4
work page 2024
-
[56]
HunyuanVideo 1.5 Technical Report
Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Wu, W., Li, Z., Gu, Y., Zhao, R., He, Y., Zhang, D.J., Shou, M.Z., Li, Y., Gao, T., Zhang, D.: Draganything: Motion control for anything using entity representation. In: ECCV (2024) 4
work page 2024
-
[58]
arXiv preprint arXiv:2508.06080 (2025) 4
Xia, B., Liu, J., Zhang, Y., Peng, B., Chu, R., Wang, Y., Wu, X., Yu, B., Jia, J.: Dreamve: Unified instruction-based image and video editing. arXiv preprint arXiv:2508.06080 (2025) 4
-
[59]
Xiao, Z., Ouyang, W., Zhou, Y., Yang, S., Yang, L., Si, J., Pan, X.: Trajectory attention for fine-grained video motion control. In: ICLR (2025) 2, 4
work page 2025
-
[60]
IEEE Robotics and Automation Practice (2026) 3, 17
Xin, C., Yu, M., Jiang, Y., Zhang, Z., Li, X.: Analyzing key objectives in human- to-robot retargeting for dexterous manipulation. IEEE Robotics and Automation Practice (2026) 3, 17
work page 2026
-
[61]
Xing, J., Mai, L., Ham, C., Huang, J., Mahapatra, A., Fu, C.W., Wong, T.T., Liu, F.: Motioncanvas: Cinematic shot design with controllable image-to-video genera- tion. In: ACM SIGGRAPH (2025) 2, 4
work page 2025
-
[62]
arXiv preprint arXiv:2508.19852 (2025) 5
Zhang, B., Shou, M.Z.: Ego-centric predictive model conditioned on hand trajec- tories. arXiv preprint arXiv:2508.19852 (2025) 5
- [63]
-
[64]
Zhang, W., Foo, L.G., Beeler, T., Dabral, R., Theobalt, C.: Vhoi: Controllable video generation of human-object interactions from sparse trajectories via motion densification. arXiv preprint arXiv:2512.09646 (2025) 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Zhang, Z., Liao, J., Li, M., Dai, Z., Qiu, B., Zhu, S., Qin, L., Wang, W.: Tora: Trajectory-oriented diffusion transformer for video generation. In: CVPR (2025) 4
work page 2025
-
[66]
Hu- manoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation,
Zhao, Z., Jing, H., Liu, X., Mao, J., Jha, A., Yang, H., Xue, R., Zakharor, S., Guizilini, V., Wang, Y.: Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation. arXiv preprint arXiv:2510.08807 (2025) 4, 11, 19
-
[67]
Zhou, H., Wang, C., Nie, R., Liu, J., Yu, D., Yu, Q., Wang, C.: Trackgo: A flexible and efficient method for controllable video generation. In: AAAI (2025) 4
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.