One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

Chenyu Zhang; Ming Li; Naixuan Huang; Si Yong Yeo; Tao He; Weilong Yan; Yucheng Chen; Yufei Shi

REVIEW 2 major objections 2 minor 2 cited by

A multi-agent system turns one user sentence into a paced and spatially consistent short drama video.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-22 06:48 UTC pith:JOGDCZKZ

load-bearing objection This paper puts together a multi-agent pipeline for one-sentence short dramas with targeted fixes for pacing and consistency, plus a new benchmark, but the human ratings lack basic reporting that would make the gains convincing. the 2 major comments →

arxiv 2605.22144 v1 pith:JOGDCZKZ submitted 2026-05-21 cs.CV

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

Yufei Shi , Weilong Yan , Naixuan Huang , Yucheng Chen , Chenyu Zhang , Tao He , Si Yong Yeo , Ming Li This is my paper

classification cs.CV

keywords short drama generationmulti-agent systemsvideo consistencynarrative pacingspatial consistencypersonalized video

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that current one-shot LLM scripts and loose production pipelines cannot deliver the narrative pacing, spatial consistency, and quality control needed for short dramas. Weak hooks, drifting character positions, and endless manual fixes result from this gap. Their hierarchical multi-agent approach uses debate among agents to shape the story, 3D-grounded first frames to anchor scene layouts, and repeated reviewer loops to catch errors at every stage. If successful, this would let anyone create coherent personalized short videos directly from an idea without constant human oversight.

Core claim

The authors present a hierarchical multi-agent framework that converts a single-sentence idea into a fully produced short drama through three components: debate-based story generation that enforces pacing and coherence, a 3D-grounded first-frame mechanism that supplies a shared spatial reference across clips, and multi-stage reviewer loops that detect and revise errors in script, visuals, and video.

What carries the argument

The multi-agent debate-based story generation module working together with the 3D-grounded first-frame generation mechanism to enforce pacing and maintain consistent character and scene positions across clips.

Load-bearing premise

That multi-agent debate reliably enforces short-drama pacing and narrative coherence while 3D grounding keeps spatial consistency across generated clips.

What would settle it

Generate videos from the same single-sentence prompts with both the proposed system and existing pipelines, then compare their scores on narrative quality and cross-clip consistency using the Short-Drama-Bench metrics.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Stories acquire stronger hooks, escalation, and endings suited to short formats.
Character positions and scene layouts stay aligned without drifting between clips.
Fewer manual corrections are required because automated loops catch script and visual errors.
Background music and scene transitions are matched automatically to increase immersion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The consistency mechanisms could support longer video series if the 3D references scale across many clips.
Repeated use might allow the agents to learn user-specific preferences for more tailored dramas.
Adding direct user input into the reviewer loops could reduce remaining quality gaps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

This paper puts together a multi-agent pipeline for one-sentence short dramas with targeted fixes for pacing and consistency, plus a new benchmark, but the human ratings lack basic reporting that would make the gains convincing.

read the letter

The key takeaway is that this paper builds a hierarchical multi-agent framework for short drama generation from one sentence, with components for story debate, 3D consistency, and iterative reviews, plus a new benchmark. It targets practical problems in automated video production but the supporting experiments lack crucial details on the human evaluations. What the paper does well is spell out the shortcomings of simple LLM script generation for short dramas and propose targeted fixes. The multi-agent debate helps with narrative pacing and coherence, while the 3D-grounded first frame aims to fix spatial drifting. Adding reviewer loops at multiple stages is a solid way to catch errors before final output. The Short-Drama-Bench seems like a reasonable extension for evaluating drama-specific qualities. The soft spots are in the results section. Claims of outperforming existing pipelines on quality and consistency come from human ratings, yet the paper skips basics like how many evaluators were involved, their agreement level, or any statistical analysis. Without that, it's hard to rule out rater variability or other factors as the real driver. The causal role of each module isn't clearly isolated either. This paper is for people working on generative AI for media and entertainment, particularly those dealing with short-form content. A reader who wants ideas on applying multi-agent systems to creative pipelines could pick up some useful patterns here. It deserves a serious referee because the task is timely and the approach is a fresh combination, even with the evaluation gaps. I would recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents 'One Sentence, One Drama', a hierarchical multi-agent framework that converts a single user sentence into a complete short drama. Key components include a debate-based story generation module to enforce pacing and coherence, a 3D-grounded first-frame mechanism for spatial consistency across clips, multi-stage reviewer loops for error detection and revision, plus scene-level BGM matching and transition planning. The authors introduce Short-Drama-Bench for evaluation and report that their method outperforms existing pipelines on narrative quality, cross-clip consistency, and viewing experience.

Significance. If the experimental claims hold after proper validation, the work offers a structured multi-agent solution to longstanding issues in automated short-drama production such as weak narrative pacing and visual drift. The introduction of Short-Drama-Bench, which augments standard video metrics with drama-specific criteria, provides a useful resource for the community. The 3D-grounding and iterative review mechanisms represent concrete engineering advances that could be adapted to related generative video tasks.

major comments (2)

[Experiments] Experiments section: The headline claim of significant outperformance rests on human ratings of narrative quality, cross-clip consistency, and viewing experience, yet the manuscript supplies no information on evaluator count, selection criteria, blinding, rating scale anchors, inter-rater reliability (e.g., Fleiss' kappa or ICC), or statistical tests (p-values, confidence intervals, multiple-comparison correction). Without these details the reported gains cannot be distinguished from rater bias or prompt sensitivity.
[Ablation Studies] §4 (or equivalent ablation subsection): The causal contribution of the multi-agent debate module and the 3D-grounded first-frame mechanism to the measured improvements is not isolated; the paper describes these components but does not present controlled ablations that remove each module while holding others fixed.

minor comments (2)

[Abstract] The abstract states outperformance without any numerical values or baseline names; adding a concise quantitative summary would strengthen the opening.
[Method] Notation for agent roles and reviewer loops is introduced without a compact diagram or table summarizing the information flow; a single overview figure would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will incorporate the suggested improvements into the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The headline claim of significant outperformance rests on human ratings of narrative quality, cross-clip consistency, and viewing experience, yet the manuscript supplies no information on evaluator count, selection criteria, blinding, rating scale anchors, inter-rater reliability (e.g., Fleiss' kappa or ICC), or statistical tests (p-values, confidence intervals, multiple-comparison correction). Without these details the reported gains cannot be distinguished from rater bias or prompt sensitivity.

Authors: We agree that these methodological details are necessary to substantiate the human evaluation results. In the revised manuscript we will expand the Experiments section to report the evaluator count, selection criteria, blinding procedures, explicit rating scale anchors, inter-rater reliability (Fleiss' kappa), and statistical analysis including p-values, confidence intervals, and multiple-comparison corrections. These additions will be placed in a dedicated subsection on evaluation protocol. revision: yes
Referee: [Ablation Studies] §4 (or equivalent ablation subsection): The causal contribution of the multi-agent debate module and the 3D-grounded first-frame mechanism to the measured improvements is not isolated; the paper describes these components but does not present controlled ablations that remove each module while holding others fixed.

Authors: We acknowledge that the current manuscript does not contain controlled ablation experiments that isolate the individual contributions of the multi-agent debate module and the 3D-grounded first-frame mechanism. We will add a new ablation subsection that presents quantitative results for ablated variants in which each of these components is removed while all other modules remain fixed, thereby clarifying their causal impact on narrative quality and consistency metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: system design and benchmark evaluation are independent

full rationale

The paper describes an engineering pipeline (multi-agent debate for story pacing, 3D-grounded first-frame for spatial consistency, reviewer loops for refinement) that transforms a one-sentence prompt into a short drama. Claims of superiority rest on comparative experiments against existing pipelines using the newly introduced Short-Drama-Bench, which extends standard metrics with drama-specific criteria. No equations, fitted parameters, or first-principles derivations appear; nothing reduces by construction to the inputs or to self-citations. The evaluation protocol, while potentially under-specified on human-rater details, is an external measurement step rather than a definitional loop. This is a standard self-contained systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework relies on standard assumptions from LLM multi-agent systems and 3D scene representation; no new free parameters, axioms, or invented entities are introduced beyond the described modules.

pith-pipeline@v0.9.0 · 5811 in / 1059 out tokens · 46563 ms · 2026-05-22T06:48:00.209088+00:00 · methodology

0 comments

read the original abstract

Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.

Figures

Figures reproduced from arXiv: 2605.22144 by Chenyu Zhang, Ming Li, Naixuan Huang, Si Yong Yeo, Tao He, Weilong Yan, Yucheng Chen, Yufei Shi.

**Figure 1.** Figure 1: From one sentence to a full short drama: we show four highlight abilities by our multi-agent pipeline in structured story synthesis, hook design, spatial consistency, and product-level quality. Abstract Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation:… view at source ↗

**Figure 2.** Figure 2: Overview of our personalized short-form drama generation pipeline four stages. Given [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Consistent first-frame synthesis via 3D scene grounding. We reconstruct a scene-level [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples. Top: comparison between our generated results and baselines [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The Multi-Agent Debating-based Story Generation Framework. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Our Diverse Transition Clips and BGM Planning & Mixing [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Gallery One Of Generated Videos [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

**Figure 8.** Figure 8: Gallery Two Of Generated Videos 32 [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

**Figure 9.** Figure 9: Our BGM Datasets 33 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for model-based Short-Drama-Bench evaluation. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt template for scene-level script review. [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt template for 3D-consistent first-frame candidate selection. [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗

**Figure 13.** Figure 13: Scoring criteria and output format for 3D-consistent first-frame candidate selection. [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt template for generated video review. [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement... multi-agent debate-based story generation module... 3D-grounded first-frame generation mechanism
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

3D-grounded first-frame generation... reconstruct a scene-level 3D world W using Marble... register generated clips and the human mesh into the shared coordinate system
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Short-Drama-Bench... narrative quality, cross-clip consistency, and overall viewing experience

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FilmWorld: Agentic Novel-to-Film Generation through Dynamic Cinematic World Modeling
cs.CV 2026-07 conditional novelty 7.0

FilmWorld generates multi-scene films from novels by materializing an explicit evolving world-state trajectory and rendering shots in parallel, beating five agents on its own FilmEval benchmark.
Training Skills Like Parameters via Self-Supervised Semantic Diffusion
cs.CL 2026-07 conditional novelty 6.0

Screenwriting skills can be learned from human scripts alone by compressing and reconstructing them, storing the lessons as text cards that improve generation without changing model weights.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 2 Pith papers · 17 internal anchors

[1]

https://www.dramaland.com/, 2026

Dramaland short drama creator service platform. https://www.dramaland.com/, 2026. Accessed: 2026-05-06. Public platform quotation for Hongguo short-drama production tiers: A-level 2000 CNY/min, S-level 3000 CNY/min, and S+-level 5000 CNY/min

work page 2026
[2]

Onestory: Coherent multi-shot video generation with adaptive memory,

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, and Tian Xie. Onestory: Coherent multi-shot video generation with adaptive memory,

work page
[3]

URLhttps://arxiv.org/abs/2512.07802

work page arXiv
[4]

Claude Opus 4.6 System Card

Anthropic. Claude Opus 4.6 System Card. https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, February 2026. System card

work page 2026
[5]

arXiv preprint arXiv:2412.07750 (2024)

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, and Gal Chechik. Multi-shot character consistency for text-to-video generation.arXiv preprint arXiv:2412.07750, 2024

work page arXiv 2024
[6]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

work page 2024
[8]

Xiao yun que ai agent

ByteDance. Xiao yun que ai agent. https://xyq.jianying.com, 2026. Closed-source commercial product built on Seedance 2.0. Accessed: 2026-04-22

work page 2026
[9]

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity.https://seed.bytedance.com/en/seed2, 2026

ByteDance Seed Team. Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity.https://seed.bytedance.com/en/seed2, 2026. Model card

work page 2026
[10]

Audience in the loop: Viewer feedback- driven content creation in micro-drama production on social media

Gengchen Cao, Tianke He, Yixuan Liu, and RAY LC. Audience in the loop: Viewer feedback- driven content creation in micro-drama production on social media. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pages 1–25, 2026

work page 2026
[11]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation, 2025. URLhttps://arxiv.org/abs/2510.02283. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Infinitystory: Unlimited video generation with world consistency and character-aware shot transitions.arXiv preprint arXiv:2603.03646, 2025

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha, Daksh Dangi, Chien Nguyen, Nedim Lipka, Trung Bui, Krishna Kumar Singh, Ruiyi Zhang, Xiaolei Huang, Jaemin Cho, Yu Wang, Namyong Park, Zhengzhong Tu, Hongjie Chen, Hoda Eld...

work page arXiv 2026
[16]

Gemini 3 pro image preview

Google AI for Developers. Gemini 3 pro image preview. https://ai.google.dev/ gemini-api/docs/models/gemini-3-pro-image-preview , 2026. Accessed: 2026-04- 24

work page 2026
[17]

Veo 3 technical report

Google DeepMind. Veo 3 technical report. https://storage.googleapis.com/ deepmind-media/veo/Veo-3-Tech-Report.pdf, 2025. Technical report

work page 2025
[18]

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling, 2025. URLhttps://arxiv.org/abs/2512.15702

work page internal anchor Pith review arXiv 2025
[19]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17281–17291, 2025

work page 2025
[20]

Toonflow.https://github.com/HBAI-Ltd/Toonflow-app, 2026

HBAI Ltd. Toonflow.https://github.com/HBAI-Ltd/Toonflow-app, 2026. Open-source project under AGPL-3.0 license. Accessed: 2026-04-22

work page 2026
[21]

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Customized storytelling video generation via multi-agent collaboration. arXiv preprint arXiv:2411.04925, 2024

work page Pith review arXiv 2024
[22]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. URLhttps://arxiv.org/abs/2311.17982

work page arXiv 2023
[24]

Hunyuanvideo: A systematic framework for large video generative models,

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page
[25]

URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Kling ai.https://klingai.com, 2024

Kuaishou Technology. Kling ai.https://klingai.com, 2024. Accessed: 2026-04-22

work page 2024
[27]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion, 2026. URL https://arxiv.org/abs/2602.07775. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

arXiv preprint arXiv:2309.15091 , year =

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

work page arXiv 2023
[29]

Videostudio: Generating consistent-content and multi-scene videos

Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Videostudio: Generating consistent-content and multi-scene videos. InEuropean Conference on Computer Vision, pages 468–485. Springer, 2024

work page 2024
[30]

Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, and Huamin Qu. Holocine: Holistic generation of cinematic multi-shot long video narratives, 2025. URL https://arxiv.org/ abs/2510.20822

work page arXiv 2025
[31]

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, and Linus. The script is all you need: An agentic framework for long-horizon dialogue-to- cinematic video generation, 2026. URLhttps://arxiv.org/abs/2601.17737

work page internal anchor Pith review arXiv 2026
[32]

GPT-Audio API Documentation, 2026

OpenAI. GPT-Audio API Documentation, 2026. URL https://platform.openai.com/ docs/models/gpt-audio. Accessed: 2026-04-30

work page 2026
[33]

Intensifying competition in the short-drama market poses challenges for long-form video platforms

Shiya Pi. Intensifying competition in the short-drama market poses challenges for long-form video platforms. Sina Finance, March 2025. URL https://finance.sina.com.cn/roll/ 2025-03-06/doc-inensrzt1029804.shtml . Accessed: 2026-05-06. The article reports that common short-drama production costs are about 10,000 CNY per minute

work page 2025
[34]

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Pvchat: Personalized video chat with one-shot learning

Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Yu, Ming Li, and Si Yong Yeo. Pvchat: Personalized video chat with one-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23321–23331, October 2025

work page 2025
[36]

Skyreels v1: Human-centric video foundation model

SkyReels-AI. Skyreels v1: Human-centric video foundation model. https://github.com/ SkyworkAI/SkyReels-V1, 2025

work page 2025
[37]

Training diffusion language models for black-box optimization.arXiv preprint arXiv:2603.17919, 2026

Zipeng Sun, Can Chen, Ye Yuan, Haolun Wu, Jiayao Gu, Christopher Pal, and Xue Liu. Training diffusion language models for black-box optimization.arXiv preprint arXiv:2603.17919, 2026. 13

work page internal anchor Pith review arXiv 2026
[38]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

work page 2026
[39]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[41]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

work page 2025
[42]

Marble: A multimodal world model

World Labs. Marble: A multimodal world model. https://www.worldlabs.ai/blog/ marble-world-model, 2026. Accessed: 2026-04-24

work page 2026
[43]

Automated Movie Generation via Multi-Agent CoT Planning

Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Automated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025

work page Pith review arXiv 2025
[44]

Scieducator: Scientific video understanding and educating via deming-cycle multi-agent system.arXiv preprint arXiv:2511.17943, 2025

Zhiyu Xu, Weilong Yan, Yufei Shi, Xin Meng, Tao He, Huiping Zhuang, Ming Li, and Hehe Fan. Scieducator: Scientific video understanding and educating via deming-cycle multi-agent system.arXiv preprint arXiv:2511.17943, 2025

work page arXiv 2025
[45]

Tan, Bing Zeng, and Shuaicheng Liu

Weilong Yan, Robby T. Tan, Bing Zeng, and Shuaicheng Liu. Deep homography mixture for single image rolling shutter correction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9868–9877, October 2023

work page 2023
[46]

Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, and Robby T. Tan. Synthetic-to-real self- supervised robust depth estimation via learning with motion and structure priors. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 21880–21890, June 2025

work page 2025
[47]

LaS- Comp: Zero-shot 3D completion with latent-spatial consistency.arXiv preprint arXiv:2602.18735, 2026

Weilong Yan, Haipeng Li, Hao Xu, Nianjin Ye, Yihao Ai, Shuaicheng Liu, and Jingyu Hu. LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency.arXiv preprint arXiv:2602.18735, 2026

work page arXiv 2026
[48]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

arXiv preprint arXiv:2602.15989 (2026)

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body human mesh recovery, 2026. URL https://arxiv.org/abs/2602.15989

work page arXiv 2026
[50]

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

Yonghan Yang, Ye Yuan, Zipeng Sun, Linfeng Du, Bowei He, Haolun Wu, Can Chen, and Xue Liu. Support-proximity augmented diffusion estimation for offline black-box optimization. arXiv preprint arXiv:2605.11246, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. URLhttps://arxiv.org/abs/2408.06072. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[53]

Storymem: Multi-shot long video storytelling with memory

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

work page arXiv 2025
[54]

4dpc2hat: Towards dynamic point cloud understanding with failure-aware bootstrapping.arXiv preprint arXiv:2602.03890, 2026

Xindan Zhang, Weilong Yan, Yufei Shi, Xuerui Qiu, Tao He, Ying Li, Ming Li, and Hehe Fan. 4dpc2hat: Towards dynamic point cloud understanding with failure-aware bootstrapping.arXiv preprint arXiv:2602.03890, 2026

work page internal anchor Pith review arXiv 2026
[55]

Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation, 2026. URL https:// arxiv.org/abs/2603.21366

work page arXiv 2026
[56]

arXiv preprint arXiv:2412.02259 (2024) 5, 14, 17, 19

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, and Ser-Nam Lim. Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention, 2025. URL https://arxiv.org/abs/2412.02259

work page arXiv 2025
[57]

Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

work page 2024
[58]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026. URLhttps://arxiv.org/abs/2602.02214

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Vistorybench: Comprehensive benchmark suite for story visualization,

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, and Chi Zhang. Vistorybench: Comprehensive benchmark suite for story visualization,

work page
[60]

URLhttps://arxiv.org/abs/2505.24862. 15 Appendix Overview This appendix provides additional details for the related work, generation pipeline, benchmark construction, evaluation protocol, implementation settings, prompts, and responsible-use discussion. • Appendix A: Broader Impacts.Potential positive impacts on creative access and production cost, as wel...

work page arXiv
[61]

On the visual side, keyframes are generated independently from a few reference images, causing spatial drift and inconsistent character placement across clips

appears to rely on one-shot LLM expansion, leading to weak hooks and brittle narrative logic. On the visual side, keyframes are generated independently from a few reference images, causing spatial drift and inconsistent character placement across clips. They also depend on manual inspection for quality control, and neither model scene-level audio or trans...

work page
[62]

After Being Abandoned at the Wedding, She Returned as an Investor Capable of Buying the Groom’s Empire 25 Category Subcategory Specific Topics

work page
[63]

The Day the Divorce Papers Were Signed, His Ex-Wife’s Company Went Public

work page
[64]

The Humiliated Stable Boy Turns Out to Be the Long-Lost Heir to the Kingdom

The Daughter-in-Law Kicked Out by Her Mother-in-Law Became the New Owner of Her Company Three Years Later Hidden Identity 1. The Humiliated Stable Boy Turns Out to Be the Long-Lost Heir to the Kingdom

work page
[65]

The Security Guard Everyone in the Company Looks Down On Has Five World Leaders’ Private Numbers in His Phone

work page
[66]

The designer’s wife, who was accused of steal- ing her boss’s manuscript, was confronted by her husband who had been secretly married

The Transfer Student Mocked by Classmates Whose Father Is Their School’s Chairman of the Board Career Comeback 1. The designer’s wife, who was accused of steal- ing her boss’s manuscript, was confronted by her husband who had been secretly married. With a single phone call, the CEO was summoned

work page
[67]

The Intern Publicly Humiliated by the Director Was Sitting in the Director’s Chair a Year Later

work page
[68]

The Woman Fired for Being Pregnant Returned as the Company’s Biggest Client

The Designer Reduced to Tears by a Client Went On to Win an International Design Award Social Realism Workplace Injustice 1. The Woman Fired for Being Pregnant Returned as the Company’s Biggest Client

work page
[69]

The Middle Manager ‘Optimized Out’ at 35 Built a Startup Team from an Unemployment Group Chat

work page
[70]

The Night the Hospital Refused Her Surgery, She Livestreamed Everything

The Engineer Forced to Sign a Non-Compete Discovered the Boss Had Already Violated His Own Medical & Survival 1. The Night the Hospital Refused Her Surgery, She Livestreamed Everything

work page
[71]

Her Father’s Life-Saving Pill Costs 700 Yuan Each, So the Daughter Went to India to Find the Manufacturer Herself

work page
[72]

When the Mother Who Favored Sons Over Daughters Fell Ill, Only the Neglected Daughter Came

In the Three Months She Was Misdiagnosed with Cancer, She Saw Everyone Around Her for Who They Really Are Family Ethics 1. When the Mother Who Favored Sons Over Daughters Fell Ill, Only the Neglected Daughter Came

work page
[73]

The Parents Gave the House to Their Son but Left the Debt to Their Daughter

work page
[74]

The Whole Family Pooled Money for the Brother to Study Abroad, but the Sister Got Into a Better School on Her Own Ancient Court In- trigue Harem Power Strug- gle

work page
[75]

The Abandoned Consort in the Cold Palace Is Determined to Put the Crown Prince on the Throne

work page
[76]

Sentenced to Death on Her First Day in the Palace, She Traded a Bowl of Poison for the Em- press’s Secret

work page
[77]

The Poisoned Princess Married the Enemy Prince Only to Burn the Empire from Within

She Pretended to Be Out of Favor for Three Years While Secretly Building a Shadow Guard That Answers Only to Her 26 Category Subcategory Specific Topics Court Conspiracy 1. The Poisoned Princess Married the Enemy Prince Only to Burn the Empire from Within

work page
[78]

Everyone Believed the Chancellor Was Loyal — Only the Crown Prince Knew He Killed the Late Emperor

work page
[79]

The Exiled General’s Daughter Returns with Her Father’s Former Army Women Breaking the Rules

work page
[80]

She Disguised Herself as a Man to Top the Imperial Exam, Only to Be Exposed in the Golden Hall

work page
[81]

The Princess Who Knew No Martial Arts Talked Down a Hundred Thousand Rebels with Words Alone

work page

Showing first 80 references.

[1] [1]

https://www.dramaland.com/, 2026

Dramaland short drama creator service platform. https://www.dramaland.com/, 2026. Accessed: 2026-05-06. Public platform quotation for Hongguo short-drama production tiers: A-level 2000 CNY/min, S-level 3000 CNY/min, and S+-level 5000 CNY/min

work page 2026

[2] [2]

Onestory: Coherent multi-shot video generation with adaptive memory,

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, and Tian Xie. Onestory: Coherent multi-shot video generation with adaptive memory,

work page

[3] [3]

URLhttps://arxiv.org/abs/2512.07802

work page arXiv

[4] [4]

Claude Opus 4.6 System Card

Anthropic. Claude Opus 4.6 System Card. https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, February 2026. System card

work page 2026

[5] [5]

arXiv preprint arXiv:2412.07750 (2024)

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, and Gal Chechik. Multi-shot character consistency for text-to-video generation.arXiv preprint arXiv:2412.07750, 2024

work page arXiv 2024

[6] [6]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

work page 2024

[8] [8]

Xiao yun que ai agent

ByteDance. Xiao yun que ai agent. https://xyq.jianying.com, 2026. Closed-source commercial product built on Seedance 2.0. Accessed: 2026-04-22

work page 2026

[9] [9]

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity.https://seed.bytedance.com/en/seed2, 2026

ByteDance Seed Team. Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity.https://seed.bytedance.com/en/seed2, 2026. Model card

work page 2026

[10] [10]

Audience in the loop: Viewer feedback- driven content creation in micro-drama production on social media

Gengchen Cao, Tianke He, Yixuan Liu, and RAY LC. Audience in the loop: Viewer feedback- driven content creation in micro-drama production on social media. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pages 1–25, 2026

work page 2026

[11] [11]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation, 2025. URLhttps://arxiv.org/abs/2510.02283. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [15]

Infinitystory: Unlimited video generation with world consistency and character-aware shot transitions.arXiv preprint arXiv:2603.03646, 2025

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha, Daksh Dangi, Chien Nguyen, Nedim Lipka, Trung Bui, Krishna Kumar Singh, Ruiyi Zhang, Xiaolei Huang, Jaemin Cho, Yu Wang, Namyong Park, Zhengzhong Tu, Hongjie Chen, Hoda Eld...

work page arXiv 2026

[15] [16]

Gemini 3 pro image preview

Google AI for Developers. Gemini 3 pro image preview. https://ai.google.dev/ gemini-api/docs/models/gemini-3-pro-image-preview , 2026. Accessed: 2026-04- 24

work page 2026

[16] [17]

Veo 3 technical report

Google DeepMind. Veo 3 technical report. https://storage.googleapis.com/ deepmind-media/veo/Veo-3-Tech-Report.pdf, 2025. Technical report

work page 2025

[17] [18]

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling, 2025. URLhttps://arxiv.org/abs/2512.15702

work page internal anchor Pith review arXiv 2025

[18] [19]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17281–17291, 2025

work page 2025

[19] [20]

Toonflow.https://github.com/HBAI-Ltd/Toonflow-app, 2026

HBAI Ltd. Toonflow.https://github.com/HBAI-Ltd/Toonflow-app, 2026. Open-source project under AGPL-3.0 license. Accessed: 2026-04-22

work page 2026

[20] [21]

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Customized storytelling video generation via multi-agent collaboration. arXiv preprint arXiv:2411.04925, 2024

work page Pith review arXiv 2024

[21] [22]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [23]

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. URLhttps://arxiv.org/abs/2311.17982

work page arXiv 2023

[23] [24]

Hunyuanvideo: A systematic framework for large video generative models,

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page

[24] [25]

URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

Kling ai.https://klingai.com, 2024

Kuaishou Technology. Kling ai.https://klingai.com, 2024. Accessed: 2026-04-22

work page 2024

[26] [27]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion, 2026. URL https://arxiv.org/abs/2602.07775. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [28]

arXiv preprint arXiv:2309.15091 , year =

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

work page arXiv 2023

[28] [29]

Videostudio: Generating consistent-content and multi-scene videos

Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Videostudio: Generating consistent-content and multi-scene videos. InEuropean Conference on Computer Vision, pages 468–485. Springer, 2024

work page 2024

[29] [30]

Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, and Huamin Qu. Holocine: Holistic generation of cinematic multi-shot long video narratives, 2025. URL https://arxiv.org/ abs/2510.20822

work page arXiv 2025

[30] [31]

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, and Linus. The script is all you need: An agentic framework for long-horizon dialogue-to- cinematic video generation, 2026. URLhttps://arxiv.org/abs/2601.17737

work page internal anchor Pith review arXiv 2026

[31] [32]

GPT-Audio API Documentation, 2026

OpenAI. GPT-Audio API Documentation, 2026. URL https://platform.openai.com/ docs/models/gpt-audio. Accessed: 2026-04-30

work page 2026

[32] [33]

Intensifying competition in the short-drama market poses challenges for long-form video platforms

Shiya Pi. Intensifying competition in the short-drama market poses challenges for long-form video platforms. Sina Finance, March 2025. URL https://finance.sina.com.cn/roll/ 2025-03-06/doc-inensrzt1029804.shtml . Accessed: 2026-05-06. The article reports that common short-drama production costs are about 10,000 CNY per minute

work page 2025

[33] [34]

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [35]

Pvchat: Personalized video chat with one-shot learning

Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Yu, Ming Li, and Si Yong Yeo. Pvchat: Personalized video chat with one-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23321–23331, October 2025

work page 2025

[35] [36]

Skyreels v1: Human-centric video foundation model

SkyReels-AI. Skyreels v1: Human-centric video foundation model. https://github.com/ SkyworkAI/SkyReels-V1, 2025

work page 2025

[36] [37]

Training diffusion language models for black-box optimization.arXiv preprint arXiv:2603.17919, 2026

Zipeng Sun, Can Chen, Ye Yuan, Haolun Wu, Jiayao Gu, Christopher Pal, and Xue Liu. Training diffusion language models for black-box optimization.arXiv preprint arXiv:2603.17919, 2026. 13

work page internal anchor Pith review arXiv 2026

[37] [38]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

work page 2026

[38] [39]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [40]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025

[40] [41]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

work page 2025

[41] [42]

Marble: A multimodal world model

World Labs. Marble: A multimodal world model. https://www.worldlabs.ai/blog/ marble-world-model, 2026. Accessed: 2026-04-24

work page 2026

[42] [43]

Automated Movie Generation via Multi-Agent CoT Planning

Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Automated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025

work page Pith review arXiv 2025

[43] [44]

Scieducator: Scientific video understanding and educating via deming-cycle multi-agent system.arXiv preprint arXiv:2511.17943, 2025

Zhiyu Xu, Weilong Yan, Yufei Shi, Xin Meng, Tao He, Huiping Zhuang, Ming Li, and Hehe Fan. Scieducator: Scientific video understanding and educating via deming-cycle multi-agent system.arXiv preprint arXiv:2511.17943, 2025

work page arXiv 2025

[44] [45]

Tan, Bing Zeng, and Shuaicheng Liu

Weilong Yan, Robby T. Tan, Bing Zeng, and Shuaicheng Liu. Deep homography mixture for single image rolling shutter correction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9868–9877, October 2023

work page 2023

[45] [46]

Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, and Robby T. Tan. Synthetic-to-real self- supervised robust depth estimation via learning with motion and structure priors. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 21880–21890, June 2025

work page 2025

[46] [47]

LaS- Comp: Zero-shot 3D completion with latent-spatial consistency.arXiv preprint arXiv:2602.18735, 2026

Weilong Yan, Haipeng Li, Hao Xu, Nianjin Ye, Yihao Ai, Shuaicheng Liu, and Jingyu Hu. LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency.arXiv preprint arXiv:2602.18735, 2026

work page arXiv 2026

[47] [48]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [49]

arXiv preprint arXiv:2602.15989 (2026)

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body human mesh recovery, 2026. URL https://arxiv.org/abs/2602.15989

work page arXiv 2026

[49] [50]

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

Yonghan Yang, Ye Yuan, Zipeng Sun, Linfeng Du, Bowei He, Haolun Wu, Can Chen, and Xue Liu. Support-proximity augmented diffusion estimation for offline black-box optimization. arXiv preprint arXiv:2605.11246, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [51]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. URLhttps://arxiv.org/abs/2408.06072. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [52]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[52] [53]

Storymem: Multi-shot long video storytelling with memory

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

work page arXiv 2025

[53] [54]

4dpc2hat: Towards dynamic point cloud understanding with failure-aware bootstrapping.arXiv preprint arXiv:2602.03890, 2026

Xindan Zhang, Weilong Yan, Yufei Shi, Xuerui Qiu, Tao He, Ying Li, Ming Li, and Hehe Fan. 4dpc2hat: Towards dynamic point cloud understanding with failure-aware bootstrapping.arXiv preprint arXiv:2602.03890, 2026

work page internal anchor Pith review arXiv 2026

[54] [55]

Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation, 2026. URL https:// arxiv.org/abs/2603.21366

work page arXiv 2026

[55] [56]

arXiv preprint arXiv:2412.02259 (2024) 5, 14, 17, 19

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, and Ser-Nam Lim. Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention, 2025. URL https://arxiv.org/abs/2412.02259

work page arXiv 2025

[56] [57]

Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

work page 2024

[57] [58]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026. URLhttps://arxiv.org/abs/2602.02214

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [59]

Vistorybench: Comprehensive benchmark suite for story visualization,

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, and Chi Zhang. Vistorybench: Comprehensive benchmark suite for story visualization,

work page

[59] [60]

URLhttps://arxiv.org/abs/2505.24862. 15 Appendix Overview This appendix provides additional details for the related work, generation pipeline, benchmark construction, evaluation protocol, implementation settings, prompts, and responsible-use discussion. • Appendix A: Broader Impacts.Potential positive impacts on creative access and production cost, as wel...

work page arXiv

[60] [61]

On the visual side, keyframes are generated independently from a few reference images, causing spatial drift and inconsistent character placement across clips

appears to rely on one-shot LLM expansion, leading to weak hooks and brittle narrative logic. On the visual side, keyframes are generated independently from a few reference images, causing spatial drift and inconsistent character placement across clips. They also depend on manual inspection for quality control, and neither model scene-level audio or trans...

work page

[61] [62]

After Being Abandoned at the Wedding, She Returned as an Investor Capable of Buying the Groom’s Empire 25 Category Subcategory Specific Topics

work page

[62] [63]

The Day the Divorce Papers Were Signed, His Ex-Wife’s Company Went Public

work page

[63] [64]

The Humiliated Stable Boy Turns Out to Be the Long-Lost Heir to the Kingdom

The Daughter-in-Law Kicked Out by Her Mother-in-Law Became the New Owner of Her Company Three Years Later Hidden Identity 1. The Humiliated Stable Boy Turns Out to Be the Long-Lost Heir to the Kingdom

work page

[64] [65]

The Security Guard Everyone in the Company Looks Down On Has Five World Leaders’ Private Numbers in His Phone

work page

[65] [66]

The designer’s wife, who was accused of steal- ing her boss’s manuscript, was confronted by her husband who had been secretly married

The Transfer Student Mocked by Classmates Whose Father Is Their School’s Chairman of the Board Career Comeback 1. The designer’s wife, who was accused of steal- ing her boss’s manuscript, was confronted by her husband who had been secretly married. With a single phone call, the CEO was summoned

work page

[66] [67]

The Intern Publicly Humiliated by the Director Was Sitting in the Director’s Chair a Year Later

work page

[67] [68]

The Woman Fired for Being Pregnant Returned as the Company’s Biggest Client

The Designer Reduced to Tears by a Client Went On to Win an International Design Award Social Realism Workplace Injustice 1. The Woman Fired for Being Pregnant Returned as the Company’s Biggest Client

work page

[68] [69]

The Middle Manager ‘Optimized Out’ at 35 Built a Startup Team from an Unemployment Group Chat

work page

[69] [70]

The Night the Hospital Refused Her Surgery, She Livestreamed Everything

The Engineer Forced to Sign a Non-Compete Discovered the Boss Had Already Violated His Own Medical & Survival 1. The Night the Hospital Refused Her Surgery, She Livestreamed Everything

work page

[70] [71]

Her Father’s Life-Saving Pill Costs 700 Yuan Each, So the Daughter Went to India to Find the Manufacturer Herself

work page

[71] [72]

When the Mother Who Favored Sons Over Daughters Fell Ill, Only the Neglected Daughter Came

In the Three Months She Was Misdiagnosed with Cancer, She Saw Everyone Around Her for Who They Really Are Family Ethics 1. When the Mother Who Favored Sons Over Daughters Fell Ill, Only the Neglected Daughter Came

work page

[72] [73]

The Parents Gave the House to Their Son but Left the Debt to Their Daughter

work page

[73] [74]

The Whole Family Pooled Money for the Brother to Study Abroad, but the Sister Got Into a Better School on Her Own Ancient Court In- trigue Harem Power Strug- gle

work page

[74] [75]

The Abandoned Consort in the Cold Palace Is Determined to Put the Crown Prince on the Throne

work page

[75] [76]

Sentenced to Death on Her First Day in the Palace, She Traded a Bowl of Poison for the Em- press’s Secret

work page

[76] [77]

The Poisoned Princess Married the Enemy Prince Only to Burn the Empire from Within

She Pretended to Be Out of Favor for Three Years While Secretly Building a Shadow Guard That Answers Only to Her 26 Category Subcategory Specific Topics Court Conspiracy 1. The Poisoned Princess Married the Enemy Prince Only to Burn the Empire from Within

work page

[77] [78]

Everyone Believed the Chancellor Was Loyal — Only the Crown Prince Knew He Killed the Late Emperor

work page

[78] [79]

The Exiled General’s Daughter Returns with Her Father’s Former Army Women Breaking the Rules

work page

[79] [80]

She Disguised Herself as a Man to Top the Imperial Exam, Only to Be Exposed in the Golden Hall

work page

[80] [81]

The Princess Who Knew No Martial Arts Talked Down a Hundred Thousand Rebels with Words Alone

work page