Recognition: unknown
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
Pith reviewed 2026-05-10 03:11 UTC · model grok-4.3
The pith
CoInteract generates human-object interaction videos with reduced hand interpenetration by training an auxiliary geometry stream that regularizes the diffusion backbone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoInteract is an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. It introduces a Human-Aware Mixture-of-Experts that routes tokens to lightweight, region-specialized experts via spatially supervised routing and Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream. During training the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference the HOI branch is removed for zero-overhead RGB generation.
What carries the argument
Spatially-Structured Co-Generation, the dual-stream training paradigm in which an auxiliary HOI structure stream attends to RGB tokens and supplies interaction geometry priors that update the shared Diffusion Transformer weights.
If this is right
- Generated videos exhibit fewer hand-object interpenetrations and more stable hand and face geometry.
- Logical consistency of interactions improves because the structure stream enforces plausible contact during training.
- Inference incurs no added compute because the auxiliary stream is discarded after training.
- The same backbone can be conditioned on person images, product images, text, and audio without retraining separate modules.
- The method scales to longer clips by maintaining the same dual-stream regularization at training time.
Where Pith is reading between the lines
- The same dual-stream pattern could be applied to other conditional video tasks such as human-scene interaction or multi-person dynamics to inject additional geometric priors without inference cost.
- Region-specialized MoE routing might reduce artifacts in other fine-grained synthesis domains like facial animation or garment deformation if spatial supervision is available.
- Removing the structure branch at inference opens the possibility of distilling the learned priors into smaller student models for real-time applications.
Load-bearing premise
The auxiliary HOI structure stream transfers physically plausible contact priors into the shared backbone weights so that the RGB-only model produces zero-interpenetration outputs once the structure branch is removed.
What would settle it
Run CoInteract and a baseline diffusion model on the same set of HOI prompts with visible hand-object contact; count frames containing hand-object interpenetration or unstable hand poses. If the counts are statistically indistinguishable, the transfer of priors has not occurred.
Figures
read the original abstract
Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CoInteract, a framework for human-object interaction (HOI) video synthesis conditioned on a person reference image, product reference image, text prompts, and speech audio. It augments a Diffusion Transformer (DiT) backbone with two designs: a Human-Aware Mixture-of-Experts (MoE) that uses spatially supervised routing to region-specialized experts for improved structural fidelity of hands and faces, and Spatially-Structured Co-Generation, a dual-stream training paradigm in which an auxiliary HOI structure stream attends to RGB tokens and provides supervision that regularizes the shared backbone weights; the HOI branch is dropped at inference for zero-overhead RGB generation. The central claim is that these components yield videos with superior structural stability, logical consistency, and physically plausible contacts (e.g., zero interpenetration) compared with prior diffusion-based HOI methods.
Significance. If the performance claims hold, the work would provide a practical, inference-efficient route to physically consistent HOI video generation that is directly relevant to e-commerce and virtual marketing. The co-generation idea—using an auxiliary geometry stream only at training time to embed contact priors—is conceptually attractive and could generalize beyond the specific backbone. The manuscript does not, however, supply any quantitative support for these benefits.
major comments (3)
- [Abstract] Abstract: the claim that CoInteract 'significantly outperforms existing methods in structural stability, logical consistency, and interaction realism' is unsupported; the manuscript contains no quantitative metrics (FID, contact error, interpenetration rate, user-study scores), no baseline comparisons, and no ablation tables.
- [Method (Spatially-Structured Co-Generation)] Spatially-Structured Co-Generation description: the assertion that auxiliary HOI-stream supervision successfully transfers contact priors into the shared DiT weights (enabling zero-interpenetration RGB outputs once the structure branch is removed) is a load-bearing assumption for the central claim, yet no weight/feature analysis, no ablation isolating the co-generation effect on contact metrics, and no comparison of interpenetration rates with/without the auxiliary stream are provided.
- [Experiments] Experimental results section: absence of any tables or figures reporting performance numbers, ablation studies on the MoE routing or dual-stream coupling, or error analysis leaves the outperformance claim unverifiable and prevents assessment of whether the regularization actually improves physical plausibility rather than merely RGB reconstruction.
minor comments (1)
- [Method] Notation for the MoE routing function and the attention between RGB and HOI tokens could be formalized with explicit equations to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current manuscript version lacks the quantitative evaluations, ablations, and analyses needed to substantiate the performance claims, and we will revise the paper accordingly to address all points raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that CoInteract 'significantly outperforms existing methods in structural stability, logical consistency, and interaction realism' is unsupported; the manuscript contains no quantitative metrics (FID, contact error, interpenetration rate, user-study scores), no baseline comparisons, and no ablation tables.
Authors: We agree that the abstract claim requires supporting quantitative evidence. The current manuscript presents only qualitative demonstrations. In the revised version, we will add a full experimental section containing FID scores, contact error, interpenetration rates, user-study scores, baseline comparisons, and ablation tables to rigorously support the outperformance claims. revision: yes
-
Referee: [Method (Spatially-Structured Co-Generation)] Spatially-Structured Co-Generation description: the assertion that auxiliary HOI-stream supervision successfully transfers contact priors into the shared DiT weights (enabling zero-interpenetration RGB outputs once the structure branch is removed) is a load-bearing assumption for the central claim, yet no weight/feature analysis, no ablation isolating the co-generation effect on contact metrics, and no comparison of interpenetration rates with/without the auxiliary stream are provided.
Authors: We acknowledge that the manuscript currently lacks direct evidence for the transfer of contact priors via co-generation. In the revision, we will include weight/feature visualizations or analyses demonstrating the effect on shared weights, an ablation isolating the auxiliary stream's impact on contact metrics, and explicit interpenetration rate comparisons with and without the co-generation component. revision: yes
-
Referee: [Experiments] Experimental results section: absence of any tables or figures reporting performance numbers, ablation studies on the MoE routing or dual-stream coupling, or error analysis leaves the outperformance claim unverifiable and prevents assessment of whether the regularization actually improves physical plausibility rather than merely RGB reconstruction.
Authors: We agree that the experimental results section is currently insufficient. The revised manuscript will expand this section with tables and figures reporting quantitative performance numbers, ablations on the MoE routing and dual-stream coupling, and error analysis to demonstrate improvements in physical plausibility beyond RGB reconstruction. revision: yes
Circularity Check
No circularity: architectural proposal with empirical claims
full rationale
The paper describes a dual-stream training procedure (RGB + auxiliary HOI structure) that regularizes shared DiT weights during training before dropping the auxiliary branch at inference. This is presented as an engineering design choice whose benefit is measured by downstream metrics, not derived as a mathematical identity or fitted parameter renamed as prediction. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the claimed physically-consistent outputs to the inputs by construction. The performance claims remain falsifiable via external benchmarks and ablations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review arXiv 2023
-
[2]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025)
work page internal anchor Pith review arXiv 2025
-
[3]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Chefer, H., Singer, U., Zohar, A., Kirstain, Y., Polyak, A., Taigman, Y., Wolf, L., Sheynin, S.: Videojam: Joint appearance-motion representations for enhanced motion generation in video models. arXiv preprint arXiv:2502.02492 (2025)
-
[5]
Chen, L., Ma, T., Liu, J., Li, B., Chen, Z., Liu, L., He, X., Li, G., He, Q., Wu, Z.: Humo:Human-centricvideogenerationviacollaborativemulti-modalconditioning. arXiv preprint arXiv:2509.08519 (2025)
-
[6]
In: Asian Conference on Computer Vision
Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Asian Conference on Computer Vision. pp. 251–263. Springer (2016)
2016
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4690–4699 (2019)
2019
-
[8]
Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024
Fei, Z., Fan, M., Yu, C., Li, D., Huang, J.: Scaling diffusion transformers to 16 billion parameters. arXiv preprint arXiv:2407.11633 (2024)
-
[9]
Gan, Q., Yang, R., Zhu, J., Xue, S., Hoi, S.: Omniavatar: Efficient audio- driven avatar video generation with adaptive body animation. arXiv preprint arXiv:2506.18866 (2025)
-
[10]
Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025
Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Meng, D., Qi, J., Qiao, P., Shen, Z., Song, Y., et al.: Wan-s2v: Audio-driven cinematic video generation. arXiv preprint arXiv:2508.18621 (2025)
-
[11]
arXiv preprint arXiv:2601.01425 (2026) 16 X
Guo, X., Ye, F., Li, X., Tu, P., Zhang, P., Sun, Q., Zhao, S., Hou, X., He, Q.: Dreamid-v: Bridging the image-to-video gap for high-fidelity face swapping via diffusion transformer. arXiv preprint arXiv:2601.01425 (2026) 16 X. Luo et al
-
[12]
arXiv preprint arXiv:2602.12160 (2026)
Guo, X., Ye, F., Sun, Q., Chen, L., Li, B., Zhang, P., Liu, J., Zhao, S., He, Q., Hou, X.: Dreamid-omni: Unified framework for controllable human-centric audio-video generation. arXiv preprint arXiv:2602.12160 (2026)
-
[13]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)
work page internal anchor Pith review arXiv 2023
-
[14]
arXiv preprint arXiv:2512.07831 , year=
Huang, J., Zhang, Y., He, X., Gao, Y., Cen, Z., Xia, B., Zhou, Y., Tao, X., Wan, P., Jia, J.: Unityvideo: Unified multi-modal multi-task learning for enhancing world- aware video generation. arXiv preprint arXiv:2512.07831 (2025)
-
[15]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)
2024
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Huang, Z., Tang, F., Zhang, Y., Cun, X., Cao, J., Li, J., Lee, T.Y.: Make-your- anchor: A diffusion-based 2d avatar generation framework. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
2024
-
[17]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)
2025
-
[18]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5148–5157 (2021)
2021
-
[19]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
arXiv preprint arXiv:2601.17323 (2026)
Li, D., Fei, Z., Li, T., Dou, Y., Chen, Z., Yang, J., Fan, M., Xu, J., Wang, J., Gu, B., et al.: Skyreels-v3 technique report. arXiv preprint arXiv:2601.17323 (2026)
-
[22]
In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li, X., Sun, Q., Zhang, P., Ye, F., Liao, Z., Feng, W., Zhao, S., He, Q.: Anydressing: Customizable multi-garment virtual dressing via latent diffusion models. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23723–23733. IEEE (2025)
2025
-
[23]
arXiv e-printsabs/2503.23907(2025)
Liao, Z., Liu, X., Qin, W., Li, Q., Wang, Q., Wan, P., Zhang, D., Zeng, L., Feng, P.: Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment. arXiv preprint arXiv:2503.23907 (2025)
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Lin, G., Jiang, J., Yang, J., Zheng, Z., Liang, C., Zhang, Y., Liu, J.: Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13847–13858 (2025)
2025
-
[25]
arXiv preprint arXiv:2409.01876 (2025)
Lin, G., Zheng, J., Yang, J., Zheng, Z., Liang, C., Zhang, Y., Liu, J.: Cyberhost: Taming audio-driven avatar diffusion model with region codebook attention. arXiv preprint arXiv:2409.01876 (2025)
-
[26]
arXiv preprint arXiv:2512.22854 (2025)
Liu,B.,Gong,X.,Zhao,Z.,Song,Z.,Lu,Y.,Wu,S.,Zhang,J.,Banerjee,S.,Zhang, H.: Byteloom: Weaving geometry-consistent human-object interactions through progressive curriculum learning. arXiv preprint arXiv:2512.22854 (2025)
-
[27]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Liu, K., Liu, Q., Liu, X., Li, J., Zhang, Y., Luo, J., He, X., Liu, W.: Hoigen-1m: A large-scale dataset for human-object interaction video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24001–24010 (2025) CoInteract 17
2025
-
[28]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Liu, L., Ma, T., Li, B., Chen, Z., Liu, J., Li, G., Zhou, S., He, Q., Wu, X.: Phantom: Subject-consistent video generation via cross-modal alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14951–14961 (2025)
2025
-
[29]
MediaPipe: A Framework for Building Perception Pipelines
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)
work page internal anchor Pith review arXiv 1906
-
[30]
In: Proceedings of the AAAI Conference on Artificial In- telligence
Luo, X., Li, Q., Liu, X., Qin, W., Yang, M., Wang, M., Wan, P., Zhang, D., Gai, K., Huang, S.L.: Filmweaver: Weaving consistent multi-shot videos with cache-guided autoregressive diffusion. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 40, pp. 7689–7697 (2026)
2026
-
[31]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Luo, X., Zhu, Y., Liu, Y., Lin, L., Wan, C., Cai, Z., Li, Y., Huang, S.L.: Canonswap: High-fidelity and consistent video face swapping via canonical space modulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10064–10074 (2025)
2025
-
[32]
Transactions on Machine Learning Research (2024)
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2024)
2024
-
[33]
Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
2023
-
[34]
In: Proceedings of the 28th ACM international conference on multimedia
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. pp. 484–492 (2020)
2020
-
[35]
Advances in Neural Information Processing Systems35, 25278–25294 (2022)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems35, 25278–25294 (2022)
2022
-
[36]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9869–9878 (2020)
2020
-
[37]
arXiv preprint arXiv:2503.14487 (2025)
Shi, M., Yuan, Z., Yang, H., Wang, X., Zheng, M., Tao, X., Zhao, W., Zheng, W., Zhou, J., Lu, J., et al.: Diffmoe: Dynamic token selection for scalable diffusion transformers. arXiv preprint arXiv:2503.14487 (2025)
-
[38]
Journal of Consumer Behaviour 23(6), 2999–3010 (2024)
Sun, L., Tang, Y.: Avatar effect of ai-enabled virtual streamers on consumer purchase intention in e-commerce livestreaming. Journal of Consumer Behaviour 23(6), 2999–3010 (2024)
2024
-
[39]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Sorber, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
In: European Conference on Computer Vision
Voleti, V., Yao, C.H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., Laforte, C., Rombach, R., Jampani, V.: Sv3d: Novel multi-view synthesis and 3d genera- tion from a single image using latent video diffusion. In: European Conference on Computer Vision. pp. 439–457. Springer (2024)
2024
-
[41]
arXiv preprint arXiv:2412.10718 (2024)
Wan, C., Luo, X., Luo, H., Cai, Z., Song, Y., Zhao, Y., Bai, Y., Wang, F., He, Y., Gong, Y.: Grid: Omni visual generation. arXiv preprint arXiv:2412.10718 (2024)
-
[42]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 18 X. Luo et al
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [43]
-
[44]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Wang, M., Wang, Q., Jiang, F., Fan, Y., Zhang, Y., Qi, Y., Zhao, K., Xu, M.: Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9891–9900 (2025)
2025
-
[45]
arXiv preprint arXiv:2107.09293 (2021)
Wang, S., Li, L., Ding, Y., Fan, C., Yu, X.: Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021)
-
[46]
JBI evidence synthesis17(6), 1101–1129 (2019)
Wonggom, P., Kourbelis, C., Newman, P., Du, H., Clark, R.A.: Effectiveness of avatar-based technology in patient education for improving chronic disease knowl- edge and self-care behavior: a systematic review. JBI evidence synthesis17(6), 1101–1129 (2019)
2019
-
[47]
Journal of advanced nursing76(9), 2401–2415 (2020)
Wonggom, P., Nolan, P., Clark, R.A., Barry, T., Burdeniuk, C., Nesbitt, K., O’Toole, K., Du, H.: Effectiveness of an avatar educational application for im- proving heart failure patients’ knowledge and self-care behaviors: A pragmatic randomized controlled trial. Journal of advanced nursing76(9), 2401–2415 (2020)
2020
-
[48]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)
work page internal anchor Pith review arXiv 2025
-
[49]
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Xie, Y., Feng, T., Zhang, X., Luo, X., Guo, Z., Yu, W., Chang, H., Ma, F., Yu, F.R.:Pointtalk:Audio-drivendynamiclippointcloudfor3dgaussian-basedtalking head synthesis. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 8753–8761 (2025)
2025
-
[51]
IEEE Transactions on Visualization and Computer Graphics (2026)
Xu, Z., Huang, Z., Cao, J., Zhang, Y., Cun, X., Shuai, Q., Wang, Y., Bao, L., Tang, F.: Anchorcrafter: Animate cyber-anchors selling your products via human-object interacting video generation. IEEE Transactions on Visualization and Computer Graphics (2026)
2026
-
[52]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Xue, H., Luo, X., Hu, Z., Zhang, X., Xiang, X., Dai, Y., Liu, J., Zhang, Z., Li, M., Yang, J., et al.: Human motion video generation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
2025
-
[53]
arXiv preprint arXiv:2511.21592 (2025)
Xue, H., Chen, Q., Wang, Z., Huang, X., Shechtman, E., Xie, J., Chen, Y.: Mogan: Improving motion quality in video diffusion via few-step motion adversarial post- training. arXiv preprint arXiv:2511.21592 (2025)
-
[54]
Infinitetalk: Audio-driven video generation for sparse-frame video dubbing,
Yang, S., Kong, Z., Gao, F., Cheng, M., Liu, X., Zhang, Y., Kang, Z., Luo, W., Cai, X., He, R., et al.: Infinitetalk: Audio-driven video generation for sparse-frame video dubbing. arXiv preprint arXiv:2508.14033 (2025)
-
[55]
Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989,
Yang, X., Kukreja, D., Pinkus, D., Sagar, A., Fan, T., Park, J., Shin, S., Cao, J., Liu, J., Ugrinovic, N., et al.: Sam 3d body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989 (2026)
-
[56]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4210–4220 (2023)
2023
-
[57]
In: Proceedings of the 32nd ACM International Conference on Multimedia
Yu, H., Qu, Z., Yu, Q., Chen, J., Jiang, Z., Chen, Z., Zhang, S., Xu, J., Wu, F., Lv, C., et al.: Gaussiantalker: Speaker-specific talking head synthesis via 3d gaussian splatting. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 3548–3557 (2024) CoInteract 19
2024
-
[58]
Packing input frame context in next-frame prediction models for video generation
Zhang, L., Agrawala, M.: Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626 (2025)
-
[59]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8652–8661 (2023)
2023
-
[60]
arXiv preprint arXiv:2602.01538 (2026)
Zhang, Y., Zhou, Z., Yu, Z., Huang, Z., Hu, T., Liang, S., Zhang, G., Peng, Z., Li, S., Chen, Y., et al.: Making avatars interact: Towards text-driven human-object interaction for controllable talking avatars. arXiv preprint arXiv:2602.01538 (2026)
-
[61]
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Zhou, D., Liu, G., Yang, H., Li, J., Lin, J., Huang, X., Liu, Y., Gao, X., Chen, C., Wen, S., et al.: Omnishow: Unifying multimodal conditions for human-object interaction video generation. arXiv preprint arXiv:2604.11804 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.