pith. sign in

arxiv: 2605.22144 · v1 · pith:JOGDCZKZnew · submitted 2026-05-21 · 💻 cs.CV

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

Pith reviewed 2026-05-22 06:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords short drama generationmulti-agent systemsvideo consistencynarrative pacingspatial consistencypersonalized video
0
0 comments X

The pith

A multi-agent system turns one user sentence into a paced and spatially consistent short drama video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that current one-shot LLM scripts and loose production pipelines cannot deliver the narrative pacing, spatial consistency, and quality control needed for short dramas. Weak hooks, drifting character positions, and endless manual fixes result from this gap. Their hierarchical multi-agent approach uses debate among agents to shape the story, 3D-grounded first frames to anchor scene layouts, and repeated reviewer loops to catch errors at every stage. If successful, this would let anyone create coherent personalized short videos directly from an idea without constant human oversight.

Core claim

The authors present a hierarchical multi-agent framework that converts a single-sentence idea into a fully produced short drama through three components: debate-based story generation that enforces pacing and coherence, a 3D-grounded first-frame mechanism that supplies a shared spatial reference across clips, and multi-stage reviewer loops that detect and revise errors in script, visuals, and video.

What carries the argument

The multi-agent debate-based story generation module working together with the 3D-grounded first-frame generation mechanism to enforce pacing and maintain consistent character and scene positions across clips.

If this is right

  • Stories acquire stronger hooks, escalation, and endings suited to short formats.
  • Character positions and scene layouts stay aligned without drifting between clips.
  • Fewer manual corrections are required because automated loops catch script and visual errors.
  • Background music and scene transitions are matched automatically to increase immersion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The consistency mechanisms could support longer video series if the 3D references scale across many clips.
  • Repeated use might allow the agents to learn user-specific preferences for more tailored dramas.
  • Adding direct user input into the reviewer loops could reduce remaining quality gaps.

Load-bearing premise

That multi-agent debate reliably enforces short-drama pacing and narrative coherence while 3D grounding keeps spatial consistency across generated clips.

What would settle it

Generate videos from the same single-sentence prompts with both the proposed system and existing pipelines, then compare their scores on narrative quality and cross-clip consistency using the Short-Drama-Bench metrics.

Figures

Figures reproduced from arXiv: 2605.22144 by Chenyu Zhang, Ming Li, Naixuan Huang, Si Yong Yeo, Tao He, Weilong Yan, Yucheng Chen, Yufei Shi.

Figure 1
Figure 1. Figure 1: From one sentence to a full short drama: we show four highlight abilities by our multi-agent pipeline in structured story synthesis, hook design, spatial consistency, and product-level quality. Abstract Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation:… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our personalized short-form drama generation pipeline four stages. Given [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Consistent first-frame synthesis via 3D scene grounding. We reconstruct a scene-level [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples. Top: comparison between our generated results and baselines [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The Multi-Agent Debating-based Story Generation Framework. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Our Diverse Transition Clips and BGM Planning & Mixing [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gallery One Of Generated Videos [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Gallery Two Of Generated Videos 32 [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Our BGM Datasets 33 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for model-based Short-Drama-Bench evaluation. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template for scene-level script review. [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt template for 3D-consistent first-frame candidate selection. [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Scoring criteria and output format for 3D-consistent first-frame candidate selection. [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template for generated video review. [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗
read the original abstract

Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents 'One Sentence, One Drama', a hierarchical multi-agent framework that converts a single user sentence into a complete short drama. Key components include a debate-based story generation module to enforce pacing and coherence, a 3D-grounded first-frame mechanism for spatial consistency across clips, multi-stage reviewer loops for error detection and revision, plus scene-level BGM matching and transition planning. The authors introduce Short-Drama-Bench for evaluation and report that their method outperforms existing pipelines on narrative quality, cross-clip consistency, and viewing experience.

Significance. If the experimental claims hold after proper validation, the work offers a structured multi-agent solution to longstanding issues in automated short-drama production such as weak narrative pacing and visual drift. The introduction of Short-Drama-Bench, which augments standard video metrics with drama-specific criteria, provides a useful resource for the community. The 3D-grounding and iterative review mechanisms represent concrete engineering advances that could be adapted to related generative video tasks.

major comments (2)
  1. [Experiments] Experiments section: The headline claim of significant outperformance rests on human ratings of narrative quality, cross-clip consistency, and viewing experience, yet the manuscript supplies no information on evaluator count, selection criteria, blinding, rating scale anchors, inter-rater reliability (e.g., Fleiss' kappa or ICC), or statistical tests (p-values, confidence intervals, multiple-comparison correction). Without these details the reported gains cannot be distinguished from rater bias or prompt sensitivity.
  2. [Ablation Studies] §4 (or equivalent ablation subsection): The causal contribution of the multi-agent debate module and the 3D-grounded first-frame mechanism to the measured improvements is not isolated; the paper describes these components but does not present controlled ablations that remove each module while holding others fixed.
minor comments (2)
  1. [Abstract] The abstract states outperformance without any numerical values or baseline names; adding a concise quantitative summary would strengthen the opening.
  2. [Method] Notation for agent roles and reviewer loops is introduced without a compact diagram or table summarizing the information flow; a single overview figure would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will incorporate the suggested improvements into the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The headline claim of significant outperformance rests on human ratings of narrative quality, cross-clip consistency, and viewing experience, yet the manuscript supplies no information on evaluator count, selection criteria, blinding, rating scale anchors, inter-rater reliability (e.g., Fleiss' kappa or ICC), or statistical tests (p-values, confidence intervals, multiple-comparison correction). Without these details the reported gains cannot be distinguished from rater bias or prompt sensitivity.

    Authors: We agree that these methodological details are necessary to substantiate the human evaluation results. In the revised manuscript we will expand the Experiments section to report the evaluator count, selection criteria, blinding procedures, explicit rating scale anchors, inter-rater reliability (Fleiss' kappa), and statistical analysis including p-values, confidence intervals, and multiple-comparison corrections. These additions will be placed in a dedicated subsection on evaluation protocol. revision: yes

  2. Referee: [Ablation Studies] §4 (or equivalent ablation subsection): The causal contribution of the multi-agent debate module and the 3D-grounded first-frame mechanism to the measured improvements is not isolated; the paper describes these components but does not present controlled ablations that remove each module while holding others fixed.

    Authors: We acknowledge that the current manuscript does not contain controlled ablation experiments that isolate the individual contributions of the multi-agent debate module and the 3D-grounded first-frame mechanism. We will add a new ablation subsection that presents quantitative results for ablated variants in which each of these components is removed while all other modules remain fixed, thereby clarifying their causal impact on narrative quality and consistency metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: system design and benchmark evaluation are independent

full rationale

The paper describes an engineering pipeline (multi-agent debate for story pacing, 3D-grounded first-frame for spatial consistency, reviewer loops for refinement) that transforms a one-sentence prompt into a short drama. Claims of superiority rest on comparative experiments against existing pipelines using the newly introduced Short-Drama-Bench, which extends standard metrics with drama-specific criteria. No equations, fitted parameters, or first-principles derivations appear; nothing reduces by construction to the inputs or to self-citations. The evaluation protocol, while potentially under-specified on human-rater details, is an external measurement step rather than a definitional loop. This is a standard self-contained systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework relies on standard assumptions from LLM multi-agent systems and 3D scene representation; no new free parameters, axioms, or invented entities are introduced beyond the described modules.

pith-pipeline@v0.9.0 · 5811 in / 1059 out tokens · 46563 ms · 2026-05-22T06:48:00.209088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 13 internal anchors

  1. [1]

    https://www.dramaland.com/, 2026

    Dramaland short drama creator service platform. https://www.dramaland.com/, 2026. Accessed: 2026-05-06. Public platform quotation for Hongguo short-drama production tiers: A-level 2000 CNY/min, S-level 3000 CNY/min, and S+-level 5000 CNY/min

  2. [2]

    Onestory: Coherent multi-shot video generation with adaptive memory,

    Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, and Tian Xie. Onestory: Coherent multi-shot video generation with adaptive memory,

  3. [3]

    URLhttps://arxiv.org/abs/2512.07802

  4. [4]

    Claude Opus 4.6 System Card

    Anthropic. Claude Opus 4.6 System Card. https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, February 2026. System card

  5. [5]

    Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024

    Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, and Gal Chechik. Multi-shot character consistency for text-to-video generation.arXiv preprint arXiv:2412.07750, 2024

  6. [6]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  7. [7]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

  8. [8]

    Xiao yun que ai agent

    ByteDance. Xiao yun que ai agent. https://xyq.jianying.com, 2026. Closed-source commercial product built on Seedance 2.0. Accessed: 2026-04-22

  9. [9]

    Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity.https://seed.bytedance.com/en/seed2, 2026

    ByteDance Seed Team. Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity.https://seed.bytedance.com/en/seed2, 2026. Model card

  10. [10]

    Audience in the loop: Viewer feedback- driven content creation in micro-drama production on social media

    Gengchen Cao, Tianke He, Yixuan Liu, and RAY LC. Audience in the loop: Viewer feedback- driven content creation in micro-drama production on social media. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pages 1–25, 2026

  11. [11]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  12. [12]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

  13. [13]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation, 2025. URLhttps://arxiv.org/abs/2510.02283. 11

  14. [15]

    Infinitystory: Unlimited video generation with world consistency and character-aware shot transitions, 2026

    Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha, Daksh Dangi, Chien Nguyen, Nedim Lipka, Trung Bui, Krishna Kumar Singh, Ruiyi Zhang, Xiaolei Huang, Jaemin Cho, Yu Wang, Namyong Park, Zhengzhong Tu, Hongjie Chen, Hoda Eld...

  15. [16]

    Gemini 3 pro image preview

    Google AI for Developers. Gemini 3 pro image preview. https://ai.google.dev/ gemini-api/docs/models/gemini-3-pro-image-preview , 2026. Accessed: 2026-04- 24

  16. [17]

    Veo 3 technical report

    Google DeepMind. Veo 3 technical report. https://storage.googleapis.com/ deepmind-media/veo/Veo-3-Tech-Report.pdf, 2025. Technical report

  17. [18]

    End-to-end training for au- toregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

    Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling, 2025. URLhttps://arxiv.org/abs/2512.15702

  18. [19]

    Long context tuning for video generation

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17281–17291, 2025

  19. [20]

    Toonflow.https://github.com/HBAI-Ltd/Toonflow-app, 2026

    HBAI Ltd. Toonflow.https://github.com/HBAI-Ltd/Toonflow-app, 2026. Open-source project under AGPL-3.0 license. Accessed: 2026-04-22

  20. [21]

    Storyagent: Customized storytelling video generation via multi-agent collaboration

    Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Customized storytelling video generation via multi-agent collaboration. arXiv preprint arXiv:2411.04925, 2024

  21. [22]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  22. [23]

    Vbench: Comprehensive benchmark suite for video generative models, 2023

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. URLhttps://arxiv.org/abs/2311.17982

  23. [24]

    Hunyuanvideo: A systematic framework for large video generative models,

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  24. [25]

    URLhttps://arxiv.org/abs/2412.03603

  25. [26]

    Kling ai.https://klingai.com, 2024

    Kuaishou Technology. Kling ai.https://klingai.com, 2024. Accessed: 2026-04-22

  26. [27]

    Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion, 2026. URL https://arxiv.org/abs/2602.07775. 12

  27. [28]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

  28. [29]

    Videostudio: Generating consistent-content and multi-scene videos

    Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Videostudio: Generating consistent-content and multi-scene videos. InEuropean Conference on Computer Vision, pages 468–485. Springer, 2024

  29. [30]

    Holocine: Holistic generation of cinematic multi-shot long video narratives, 2025

    Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, and Huamin Qu. Holocine: Holistic generation of cinematic multi-shot long video narratives, 2025. URL https://arxiv.org/ abs/2510.20822

  30. [31]

    The script is all you need: An agentic framework for long-horizon dialogue-to- cinematic video generation, 2026

    Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, and Linus. The script is all you need: An agentic framework for long-horizon dialogue-to- cinematic video generation, 2026. URLhttps://arxiv.org/abs/2601.17737

  31. [32]

    GPT-Audio API Documentation, 2026

    OpenAI. GPT-Audio API Documentation, 2026. URL https://platform.openai.com/ docs/models/gpt-audio. Accessed: 2026-04-30

  32. [33]

    Intensifying competition in the short-drama market poses challenges for long-form video platforms

    Shiya Pi. Intensifying competition in the short-drama market poses challenges for long-form video platforms. Sina Finance, March 2025. URL https://finance.sina.com.cn/roll/ 2025-03-06/doc-inensrzt1029804.shtml . Accessed: 2026-05-06. The article reports that common short-drama production costs are about 10,000 CNY per minute

  33. [34]

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...

  34. [35]

    Pvchat: Personalized video chat with one-shot learning

    Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Yu, Ming Li, and Si Yong Yeo. Pvchat: Personalized video chat with one-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23321–23331, October 2025

  35. [36]

    Skyreels v1: Human-centric video foundation model

    SkyReels-AI. Skyreels v1: Human-centric video foundation model. https://github.com/ SkyworkAI/SkyReels-V1, 2025

  36. [37]

    Training diffusion language models for black-box optimization.arXiv preprint arXiv:2603.17919, 2026

    Zipeng Sun, Can Chen, Ye Yuan, Haolun Wu, Jiayao Gu, Christopher Pal, and Xue Liu. Training diffusion language models for black-box optimization.arXiv preprint arXiv:2603.17919, 2026. 13

  37. [38]

    Qwen3.5-omni technical report, 2026

    Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

  38. [39]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  39. [40]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  40. [41]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  41. [42]

    Marble: A multimodal world model

    World Labs. Marble: A multimodal world model. https://www.worldlabs.ai/blog/ marble-world-model, 2026. Accessed: 2026-04-24

  42. [43]

    Automated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025

    Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Automated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025

  43. [44]

    Scieducator: Scientific video understanding and educating via deming-cycle multi-agent system.arXiv preprint arXiv:2511.17943, 2025

    Zhiyu Xu, Weilong Yan, Yufei Shi, Xin Meng, Tao He, Huiping Zhuang, Ming Li, and Hehe Fan. Scieducator: Scientific video understanding and educating via deming-cycle multi-agent system.arXiv preprint arXiv:2511.17943, 2025

  44. [45]

    Tan, Bing Zeng, and Shuaicheng Liu

    Weilong Yan, Robby T. Tan, Bing Zeng, and Shuaicheng Liu. Deep homography mixture for single image rolling shutter correction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9868–9877, October 2023

  45. [46]

    Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, and Robby T. Tan. Synthetic-to-real self- supervised robust depth estimation via learning with motion and structure priors. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 21880–21890, June 2025

  46. [47]

    InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Weilong Yan, Haipeng Li, Hao Xu, Nianjin Ye, Yihao Ai, Shuaicheng Liu, and Jingyu Hu. LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency.arXiv preprint arXiv:2602.18735, 2026

  47. [48]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  48. [49]

    Sam 3d body: Robust full-body human mesh recovery, 2026

    Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body human mesh recovery, 2026. URL https://arxiv.org/abs/2602.15989

  49. [50]

    Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

    Yonghan Yang, Ye Yuan, Zipeng Sun, Linfeng Du, Bowei He, Haolun Wu, Can Chen, and Xue Liu. Support-proximity augmented diffusion estimation for offline black-box optimization. arXiv preprint arXiv:2605.11246, 2026

  50. [51]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. URLhttps://arxiv.org/abs/2408.06072. 14

  51. [52]

    Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  52. [53]

    Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

    Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

  53. [54]

    4dpc2hat: Towards dynamic point cloud understanding with failure-aware bootstrapping.arXiv preprint arXiv:2602.03890, 2026

    Xindan Zhang, Weilong Yan, Yufei Shi, Xuerui Qiu, Tao He, Ying Li, Ming Li, and Hehe Fan. 4dpc2hat: Towards dynamic point cloud understanding with failure-aware bootstrapping.arXiv preprint arXiv:2602.03890, 2026

  54. [55]

    Relax forcing: Relaxed kv-memory for consistent long video generation, 2026

    Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation, 2026. URL https:// arxiv.org/abs/2603.21366

  55. [56]

    Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention, 2025

    Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, and Ser-Nam Lim. Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention, 2025. URL https://arxiv.org/abs/2412.02259

  56. [57]

    Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

  57. [58]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026. URLhttps://arxiv.org/abs/2602.02214

  58. [59]

    Vistorybench: Comprehensive benchmark suite for story visualization,

    Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, and Chi Zhang. Vistorybench: Comprehensive benchmark suite for story visualization,

  59. [60]

    URLhttps://arxiv.org/abs/2505.24862. 15 Appendix Overview This appendix provides additional details for the related work, generation pipeline, benchmark construction, evaluation protocol, implementation settings, prompts, and responsible-use discussion. • Appendix A: Broader Impacts.Potential positive impacts on creative access and production cost, as wel...

  60. [61]

    On the visual side, keyframes are generated independently from a few reference images, causing spatial drift and inconsistent character placement across clips

    appears to rely on one-shot LLM expansion, leading to weak hooks and brittle narrative logic. On the visual side, keyframes are generated independently from a few reference images, causing spatial drift and inconsistent character placement across clips. They also depend on manual inspection for quality control, and neither model scene-level audio or trans...

  61. [62]

    After Being Abandoned at the Wedding, She Returned as an Investor Capable of Buying the Groom’s Empire 25 Category Subcategory Specific Topics

  62. [63]

    The Day the Divorce Papers Were Signed, His Ex-Wife’s Company Went Public

  63. [64]

    The Humiliated Stable Boy Turns Out to Be the Long-Lost Heir to the Kingdom

    The Daughter-in-Law Kicked Out by Her Mother-in-Law Became the New Owner of Her Company Three Years Later Hidden Identity 1. The Humiliated Stable Boy Turns Out to Be the Long-Lost Heir to the Kingdom

  64. [65]

    The Security Guard Everyone in the Company Looks Down On Has Five World Leaders’ Private Numbers in His Phone

  65. [66]

    The designer’s wife, who was accused of steal- ing her boss’s manuscript, was confronted by her husband who had been secretly married

    The Transfer Student Mocked by Classmates Whose Father Is Their School’s Chairman of the Board Career Comeback 1. The designer’s wife, who was accused of steal- ing her boss’s manuscript, was confronted by her husband who had been secretly married. With a single phone call, the CEO was summoned

  66. [67]

    The Intern Publicly Humiliated by the Director Was Sitting in the Director’s Chair a Year Later

  67. [68]

    The Woman Fired for Being Pregnant Returned as the Company’s Biggest Client

    The Designer Reduced to Tears by a Client Went On to Win an International Design Award Social Realism Workplace Injustice 1. The Woman Fired for Being Pregnant Returned as the Company’s Biggest Client

  68. [69]

    The Middle Manager ‘Optimized Out’ at 35 Built a Startup Team from an Unemployment Group Chat

  69. [70]

    The Night the Hospital Refused Her Surgery, She Livestreamed Everything

    The Engineer Forced to Sign a Non-Compete Discovered the Boss Had Already Violated His Own Medical & Survival 1. The Night the Hospital Refused Her Surgery, She Livestreamed Everything

  70. [71]

    Her Father’s Life-Saving Pill Costs 700 Yuan Each, So the Daughter Went to India to Find the Manufacturer Herself

  71. [72]

    When the Mother Who Favored Sons Over Daughters Fell Ill, Only the Neglected Daughter Came

    In the Three Months She Was Misdiagnosed with Cancer, She Saw Everyone Around Her for Who They Really Are Family Ethics 1. When the Mother Who Favored Sons Over Daughters Fell Ill, Only the Neglected Daughter Came

  72. [73]

    The Parents Gave the House to Their Son but Left the Debt to Their Daughter

  73. [74]

    The Whole Family Pooled Money for the Brother to Study Abroad, but the Sister Got Into a Better School on Her Own Ancient Court In- trigue Harem Power Strug- gle

  74. [75]

    The Abandoned Consort in the Cold Palace Is Determined to Put the Crown Prince on the Throne

  75. [76]

    Sentenced to Death on Her First Day in the Palace, She Traded a Bowl of Poison for the Em- press’s Secret

  76. [77]

    The Poisoned Princess Married the Enemy Prince Only to Burn the Empire from Within

    She Pretended to Be Out of Favor for Three Years While Secretly Building a Shadow Guard That Answers Only to Her 26 Category Subcategory Specific Topics Court Conspiracy 1. The Poisoned Princess Married the Enemy Prince Only to Burn the Empire from Within

  77. [78]

    Everyone Believed the Chancellor Was Loyal — Only the Crown Prince Knew He Killed the Late Emperor

  78. [79]

    The Exiled General’s Daughter Returns with Her Father’s Former Army Women Breaking the Rules

  79. [80]

    She Disguised Herself as a Man to Top the Imperial Exam, Only to Be Exposed in the Golden Hall

  80. [81]

    The Princess Who Knew No Martial Arts Talked Down a Hundred Thousand Rebels with Words Alone

Showing first 80 references.