DramaDirector: Geometry-Guided Short Drama Generation

Hengji Zhou; Jianrun Chen; Lianghao Xia; Liqiang Nie; Sijie Liu; Xingchen Zou

arxiv: 2606.24107 · v1 · pith:FO5HC7C6new · submitted 2026-06-23 · 💻 cs.CV · cs.AI

DramaDirector: Geometry-Guided Short Drama Generation

Hengji Zhou , Sijie Liu , Jianrun Chen , Xingchen Zou , Lianghao Xia , Liqiang Nie This is my paper

Pith reviewed 2026-06-26 01:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords short drama generationgeometry-guided videodepth pose retrievalmulti-shot synthesiscinematographic groundingplot to videoDramaBoard benchmarkvisual consistency

0 comments

The pith

DramaDirector retrieves depth and pose from real short-drama shots to guide multi-shot video generation from plots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework for generating short dramas from plots by borrowing geometric references from a gallery of real footage. It decouples shots into static visual conditions and dynamic narrative conditions, then uses retrieval to guide generation. This addresses the challenges of rapid shot changes and cinematographic requirements that text-only methods fail to meet. The authors also create a benchmark from live-action dramas to evaluate faithfulness, consistency, and controllability. A reader would care because it offers a way to make AI-generated videos more grounded in real visual geometry for narrative content.

Core claim

DramaDirector is a geometry-grounded framework that retrieves depth-pose references from real short-drama shots to guide first-frame generation and image-to-video synthesis, trained with schema-constrained SFT and GRPO under a text-visual alignment reward, leading to improved performance on faithfulness, consistency, and controllability over baselines.

What carries the argument

The retrieval of depth and pose references from a gallery of real short-drama shots indexed by depth and pose, used to guide generation while decoupling static visual and dynamic narrative conditions.

If this is right

Short drama generation can achieve higher visual grounding by referencing real cinematography.
Multi-agent and video generation baselines can be outperformed by incorporating geometric retrieval.
Structured storyboards in benchmarks like DramaBoard enable better evaluation of narrative video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger galleries of indexed shots could further improve reference matching for diverse plots.
This retrieval method might apply to other domains requiring geometric consistency, such as animation or virtual production.

Load-bearing premise

That a gallery of real short-drama shots will contain sufficiently relevant depth and pose references for arbitrary new plots, allowing reliable transfer without mismatches.

What would settle it

Generating dramas for plots whose required camera geometries are absent from the real-shot gallery and observing whether generation quality degrades significantly compared to in-gallery cases.

Figures

Figures reproduced from arXiv: 2606.24107 by Hengji Zhou, Jianrun Chen, Lianghao Xia, Liqiang Nie, Sijie Liu, Xingchen Zou.

**Figure 2.** Figure 2: Overall architecture of the proposed DramaDirector framework for short drama generation. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Tuning Performance of different ablations. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: The end-to-end short drama generation process [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Hyperparameter study on GRPO steps and ’Static’ shots. Rather than relying on raw text, the Step 3 retrieves spatial layouts and body configurations aligned with the generated script. This retrieval-augmented grounding informs the video generation phase (Step 4), ensuring the final narrative faithfully reflects specified actions while maintaining visual aesthetics and temporal consistency. 4 Related Wor… view at source ↗

**Figure 7.** Figure 7: Comparison of generation workflows between DramaDirector, image-to-video, and text-to-video [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Balancing Negative and Hard Negative Losses [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Episode-level semantic annotation prompt [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Shot-level transcript correction prompt Shot-level storyboard annotation Given the keyframe, shot duration, corrected transcript, and episode-level context, we prompt a multimodal model to produce one structured storyboard object per shot, as shown in 14. Each object includes camera fields, subjects, background, narrative, dialogue, speaker, emotion, and duration. The prompt enforces one-to-one frame-to… view at source ↗

**Figure 11.** Figure 11: Conditioning instructions appended to the [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Shot-Level Human Evaluation Rubric A.7 Human Evaluation To complement the automatic judge-based evaluation, we further conduct a human study to directly assess the perceptual quality of generated shortdrama videos. We randomly sample 30 generated video cases from the evaluation set, where each case contains 4–5 storyboard shots. For DramaDirector and three representative baselines, ShoulderShot, GenMa… view at source ↗

**Figure 13.** Figure 13: Human evaluation results across six metrics. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Shot-level storyboard annotation prompt System: You are a professional film and television storyboard scriptwriting assistant. Based on the provided plot summary, character information, and existing storyboard descriptions, continue generating the next batch of storyboard descriptions. [Output Format] Strictly output a JSON array, where each element is a storyboard object. The fields should follow this te… view at source ↗

**Figure 15.** Figure 15: An example SFT prompt for storyboard continuation. Long contexts are truncated for readability [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Detailed evaluation prompt and rubric used for assessing static image generation quality. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Detailed evaluation prompt and rubric used for assessing temporal and dynamic video generation quality. [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

read the original abstract

Short dramas, with their rapid shot rhythms, dialogue-driven focus shifts, and demanding cinematographic grounding, pose challenges that prompt-level or text-only video generation pipelines struggle to meet. We study plot-to-short-drama generation, where a global plot and local context are transformed into visually grounded multi-shot videos. We propose DramaDirector, a geometry-grounded framework that lets the planner borrow cinematographic geometry from a gallery of real short-drama shots indexed by depth and pose. DramaDirector decouples each shot into static visual and dynamic narrative conditions, trains the planner with schema-constrained SFT and GRPO under a learned text-visual alignment reward, and retrieves depth-pose references to guide first-frame generation and image-to-video synthesis. We also introduce DramaBoard, a benchmark built from 35 live-action dramas, 2.8K episodes, and 81K shots, with structured storyboards and multi-dimensional evaluation protocols. Experiments show that DramaDirector improves over representative multi-agent and video generation baselines on faithfulness, consistency, and controllability. Our code is released at: https://github.com/iLearn-Lab/DramaDirector

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DramaDirector adds a retrieval step from a 35-drama gallery for depth-pose guidance in plot-to-short-drama video plus a new benchmark, but the abstract supplies no numbers or retrieval diagnostics so the claimed gains stay unverified.

read the letter

The main point is that this paper takes a retrieval approach to ground short-drama generation in real cinematographic geometry instead of text alone, and it ships a benchmark built from 35 live-action dramas.

It does a few things cleanly. The split between static visual conditions and dynamic narrative conditions is a reasonable way to handle the rapid shot changes and dialogue focus in short dramas. Using schema-constrained SFT plus GRPO with a learned text-visual reward gives the planner some structure. DramaBoard itself, with its 2.8K episodes and 81K shots plus structured storyboards, is a concrete resource that the field can actually use.

The soft spot is the missing evidence on whether the retrieval step works. The gallery is fixed at 35 dramas, so coverage for arbitrary new plots is an open question. The abstract claims better faithfulness, consistency, and controllability over baselines but gives no scores, no error bars, no ablation on retrieval success rate, and no failure cases. Without those, it is impossible to tell if the geometry references are actually carrying the improvement or if poor matches are just adding noise.

This is for people working on controllable multi-shot video synthesis, especially anyone who needs story-level benchmarks. A reader who wants to test retrieval-augmented pipelines on narrative content would find the benchmark and the overall framing useful.

It deserves peer review so the full experiments and retrieval statistics can be checked.

Referee Report

2 major / 2 minor

Summary. The paper introduces DramaDirector, a geometry-grounded framework for plot-to-short-drama generation. It decouples each shot into static visual and dynamic narrative conditions, retrieves depth-pose references from a gallery of real short-drama shots (built from 35 dramas and 81K shots) to condition first-frame generation and I2V synthesis, trains a planner via schema-constrained SFT and GRPO with a learned text-visual alignment reward, and introduces the DramaBoard benchmark with structured storyboards and multi-dimensional evaluation. The central claim is that this yields improvements over representative multi-agent and video generation baselines on faithfulness, consistency, and controllability.

Significance. If the retrieval-based geometry guidance proves reliable, the work could offer a practical route to cinematographically grounded multi-shot video generation that goes beyond text-only conditioning. The open release of code and the introduction of a benchmark with explicit evaluation protocols are concrete strengths that support reproducibility and future comparisons in narrative video synthesis.

major comments (2)

[Framework description and retrieval component] The central claim depends on the retrieval step (described in the framework overview) reliably surfacing geometrically compatible depth-pose references from the fixed 35-drama gallery for arbitrary new plots. No coverage statistics, retrieval success rates, similarity thresholds, or failure-mode analysis are supplied, so it is impossible to verify whether poor matches degrade the geometry guidance into noisy conditioning.
[Abstract and Experiments section] The abstract states that experiments demonstrate improvements on faithfulness, consistency, and controllability, yet supplies no quantitative metrics, error bars, dataset splits, or ablation results on the retrieval component. This absence prevents assessment of whether the reported gains are attributable to the geometry guidance or to other factors.

minor comments (2)

[Abstract] Acronyms such as SFT and GRPO are used without expansion on first appearance in the abstract and method description.
[DramaBoard benchmark description] The benchmark construction details (2.8K episodes, 81K shots) would benefit from an explicit statement of how the 35 source dramas were selected to ensure diversity of plots and cinematographic styles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional analysis and clarifications.

read point-by-point responses

Referee: [Framework description and retrieval component] The central claim depends on the retrieval step (described in the framework overview) reliably surfacing geometrically compatible depth-pose references from the fixed 35-drama gallery for arbitrary new plots. No coverage statistics, retrieval success rates, similarity thresholds, or failure-mode analysis are supplied, so it is impossible to verify whether poor matches degrade the geometry guidance into noisy conditioning.

Authors: We agree that quantitative characterization of the retrieval step is needed to substantiate its reliability. The revised manuscript will add coverage statistics over the 81K-shot gallery, retrieval success rates computed with explicit similarity thresholds on depth-pose features, and a failure-mode analysis that quantifies the impact of poor matches on downstream generation quality. revision: yes
Referee: [Abstract and Experiments section] The abstract states that experiments demonstrate improvements on faithfulness, consistency, and controllability, yet supplies no quantitative metrics, error bars, dataset splits, or ablation results on the retrieval component. This absence prevents assessment of whether the reported gains are attributable to the geometry guidance or to other factors.

Authors: The experiments section already reports quantitative metrics, dataset construction details, and baseline comparisons; however, we acknowledge the absence of retrieval-specific ablations and the lack of numeric results in the abstract. We will revise the abstract to include key quantitative improvements with error bars and add explicit ablation studies isolating the retrieval component in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmark and released code

full rationale

The paper presents an applied system (DramaDirector) that retrieves depth-pose references from a fixed gallery of 81K shots drawn from 35 dramas, trains a planner via schema-constrained SFT + GRPO with a learned reward, and evaluates on the introduced DramaBoard benchmark. No equations, fitted parameters renamed as predictions, or derivation chains appear in the provided text. Claims of improved faithfulness/consistency/controllability rest on experimental comparison against baselines rather than any self-referential reduction. The gallery construction, retrieval mechanism, and benchmark are described as independent artifacts; the code release further supports external verification. This matches the default case of a self-contained empirical contribution with no load-bearing self-citation or definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, parameters, or background assumptions are detailed enough to populate the ledger.

pith-pipeline@v0.9.1-grok · 5738 in / 1084 out tokens · 29387 ms · 2026-06-26T01:55:29.502701+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 14 linked inside Pith

[1]

Advances in Neural Information Processing Systems , volume=

Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback , author=. Advances in Neural Information Processing Systems , volume=
[2]

arXiv preprint arXiv:2603.28767 , year=

Gen-searcher: Reinforcing agentic search for image generation , author=. arXiv preprint arXiv:2603.28767 , year=

Pith/arXiv arXiv
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vita: An efficient video-to-text algorithm using vlm for rag-based video analysis system , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[4]

International Conference on Learning Representations , volume=

T2v-turbo-v2: Enhancing video model post-training through data, reward, and conditional guidance design , author=. International Conference on Learning Representations , volume=
[7]

Proceedings of the 29th International Conference on Computational Linguistics , pages=

Of human criteria and automatic metrics: A benchmark of the evaluation of story generation , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=
[8]

arXiv preprint arXiv:2510.12323 , year=

Rag-anything: All-in-one rag framework , author=. arXiv preprint arXiv:2510.12323 , year=

arXiv
[9]

arXiv preprint arXiv:2508.07597 , year=

ShoulderShot: Generating Over-the-Shoulder Dialogue Videos , author=. arXiv preprint arXiv:2508.07597 , year=

arXiv
[10]

arXiv preprint arXiv:2504.02436 , year=

Skyreels-a2: Compose anything in video diffusion trans formers , author=. arXiv preprint arXiv:2504.02436 , year=

arXiv
[11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Dreamrunner: Fine-grained compositional story-to-video generation with retrieval-augmented motion adaptation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[12]

Proceedings of the computer vision and pattern recognition conference , pages=

Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems , author=. Proceedings of the computer vision and pattern recognition conference , pages=
[13]

2026 , howpublished =

Wan2.6 , author =. 2026 , howpublished =

2026
[14]

2026 , howpublished =

Vidu Q3 Turbo , author =. 2026 , howpublished =

2026
[15]

2026 , howpublished =

Nano Banana 2: Gemini AI Image Generator and Photo Editor , author =. 2026 , howpublished =

2026
[16]

2026 , howpublished =

Seedream 5.0 Lite , author =. 2026 , howpublished =

2026
[17]

2026 , howpublished =

Multimodal Embedding API Reference , author =. 2026 , howpublished =

2026
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Genmac: compositional text-to-video generation with multi-agent collaboration , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[20]

arXiv preprint arXiv:2511.08521 , year=

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist , author=. arXiv preprint arXiv:2511.08521 , year=

arXiv
[21]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Videoauteur: Towards long narrative video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[22]

arXiv preprint arXiv:2410.05779 , volume=

Lightrag: Simple and fast retrieval-augmented generation , author=. arXiv preprint arXiv:2410.05779 , volume=

Pith/arXiv arXiv
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Patho-AgenticRAG: towards multimodal agentic retrieval-augmented generation for pathology VLMs via reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[24]

Tongyi-Embedding-Vision: Multimodal Embedding API , year =
[25]

arXiv preprint arXiv:2311.15127 , year =

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author =. arXiv preprint arXiv:2311.15127 , year =

Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2410.13720 , year =

Movie Gen: A Cast of Media Foundation Models , author =. arXiv preprint arXiv:2410.13720 , year =

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2311.04145 , year =

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models , author =. arXiv preprint arXiv:2311.04145 , year =

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2209.14958 , year =

Co-Writing Screenplays and Theatre Scripts with Language Models: An Evaluation by Industry Professionals , author =. arXiv preprint arXiv:2209.14958 , year =

arXiv
[29]

arXiv preprint arXiv:2309.15091 , year =

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning , author =. arXiv preprint arXiv:2309.15091 , year =

arXiv
[30]

arXiv preprint arXiv:2411.04925 , year =

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration , author =. arXiv preprint arXiv:2411.04925 , year =

arXiv
[31]

arXiv preprint arXiv:2503.05242 , year =

MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio , author =. arXiv preprint arXiv:2503.05242 , year =

arXiv
[32]

arXiv preprint arXiv:2408.09333 , year =

SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama , author =. arXiv preprint arXiv:2408.09333 , year =

arXiv
[33]

arXiv preprint arXiv:2602.21818 , year =

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model , author =. arXiv preprint arXiv:2602.21818 , year =

arXiv
[34]

arXiv preprint arXiv:2412.20725 , year =

Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling , author =. arXiv preprint arXiv:2412.20725 , year =

arXiv
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[36]

arXiv preprint arXiv:2302.05543 , year =

Adding Conditional Control to Text-to-Image Diffusion Models , author =. arXiv preprint arXiv:2302.05543 , year =

Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2311.16933 , year =

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models , author =. arXiv preprint arXiv:2311.16933 , year =

arXiv
[38]

arXiv preprint arXiv:2309.00398 , year =

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation , author =. arXiv preprint arXiv:2309.00398 , year =

arXiv
[39]

arXiv preprint arXiv:2310.12190 , year =

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors , author =. arXiv preprint arXiv:2310.12190 , year =

arXiv
[40]

arXiv preprint arXiv:2404.02101 , year =

CameraCtrl: Enabling Camera Control for Text-to-Video Generation , author =. arXiv preprint arXiv:2404.02101 , year =

Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2410.15957 , year =

CamI2V: Camera-Controlled Image-to-Video Diffusion Model , author =. arXiv preprint arXiv:2410.15957 , year =

arXiv
[42]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
[43]

arXiv preprint arXiv:2501.12948 , year =

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. arXiv preprint arXiv:2501.12948 , year =

Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2503.06749 , year =

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author =. arXiv preprint arXiv:2503.06749 , year =

Pith/arXiv arXiv
[45]

arXiv preprint arXiv:2501.09099 , year=

Drama llama: An llm-powered storylets framework for authorable responsiveness in interactive narrative , author=. arXiv preprint arXiv:2501.09099 , year=

arXiv
[46]

arXiv preprint arXiv:2506.18899 , year=

Filmaster: Bridging cinematic principles and generative ai for automated film generation , author=. arXiv preprint arXiv:2506.18899 , year=

arXiv
[47]

Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Audience in the loop: Viewer feedback-driven content creation in micro-drama production on social media , author=. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

2026
[48]

arXiv preprint arXiv:2603.02681 , year =

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation , author =. arXiv preprint arXiv:2603.02681 , year =

arXiv
[49]

arXiv preprint arXiv:2603.08812 , year =

VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model , author =. arXiv preprint arXiv:2603.08812 , year =

arXiv
[50]

arXiv preprint arXiv:2505.24073 , year =

mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation , author =. arXiv preprint arXiv:2505.24073 , year =

arXiv
[51]

arXiv preprint arXiv:2602.09609 , year =

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing , author =. arXiv preprint arXiv:2602.09609 , year =

arXiv
[52]

arXiv preprint arXiv:2604.09195 , year =

Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation , author =. arXiv preprint arXiv:2604.09195 , year =

Pith/arXiv arXiv
[53]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

A Survey on LLMs for Story Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

2025
[54]

arXiv preprint arXiv:2507.18634 , year =

Captain Cinema: Towards Short Movie Generation , author =. arXiv preprint arXiv:2507.18634 , year =

arXiv
[55]

arXiv preprint arXiv:2510.23163 , year =

Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs , author =. arXiv preprint arXiv:2510.23163 , year =

arXiv
[56]

International Conference on Learning Representations , year =

VADER: Video Diffusion Alignment via Reward Gradients , author =. International Conference on Learning Representations , year =
[57]

arXiv preprint arXiv:2512.12372 , year =

STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative , author =. arXiv preprint arXiv:2512.12372 , year =

arXiv
[58]

arXiv preprint arXiv:2604.03315 , year =

StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics , author =. arXiv preprint arXiv:2604.03315 , year =

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2512.07802 , year =

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory , author =. arXiv preprint arXiv:2512.07802 , year =

arXiv
[61]

IEEE/CVF International Conference on Computer Vision (ICCV) , year=

VideoAuteur: Towards long narrative video generation , author=. IEEE/CVF International Conference on Computer Vision (ICCV) , year=
[62]

arXiv preprint arXiv:2507.00001 , year=

Long Context Tuning for multi-shot video generation , author=. arXiv preprint arXiv:2507.00001 , year=

Pith/arXiv arXiv
[63]

arXiv preprint arXiv:2503.07314 , year=

MovieBench: Hierarchical annotations for movie generation , author=. arXiv preprint arXiv:2503.07314 , year=

arXiv
[64]

International Conference on Learning Representations (ICLR) 2025 , year=

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences , author=. International Conference on Learning Representations (ICLR) 2025 , year=

2025
[65]

arXiv preprint arXiv:2512.19539 , year=

StoryMem: Memory-Augmented Video Storytelling , author=. arXiv preprint arXiv:2512.19539 , year=

arXiv
[66]

arXiv preprint arXiv:2506.01908 , year=

Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency , author=. arXiv preprint arXiv:2506.01908 , year=

arXiv
[67]

arXiv preprint arXiv:2505.23990 , year=

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding , author=. arXiv preprint arXiv:2505.23990 , year=

arXiv
[68]

arXiv preprint arXiv:2604.05418 , year=

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG , author=. arXiv preprint arXiv:2604.05418 , year=

Pith/arXiv arXiv
[69]

NeurIPS 2025 , year=

EchoShot: Multi-Shot Portrait Video Generation , author=. NeurIPS 2025 , year=

2025
[70]

arXiv preprint arXiv:2406.09414 , year=

Depth Anything V2 , author=. arXiv preprint arXiv:2406.09414 , year=

Pith/arXiv arXiv
[71]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Effective Whole-Body Pose Estimation with Two-Stages Distillation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[1] [1]

Advances in Neural Information Processing Systems , volume=

Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback , author=. Advances in Neural Information Processing Systems , volume=

[2] [2]

arXiv preprint arXiv:2603.28767 , year=

Gen-searcher: Reinforcing agentic search for image generation , author=. arXiv preprint arXiv:2603.28767 , year=

Pith/arXiv arXiv

[3] [3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vita: An efficient video-to-text algorithm using vlm for rag-based video analysis system , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[4] [4]

International Conference on Learning Representations , volume=

T2v-turbo-v2: Enhancing video model post-training through data, reward, and conditional guidance design , author=. International Conference on Learning Representations , volume=

[5] [7]

Proceedings of the 29th International Conference on Computational Linguistics , pages=

Of human criteria and automatic metrics: A benchmark of the evaluation of story generation , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=

[6] [8]

arXiv preprint arXiv:2510.12323 , year=

Rag-anything: All-in-one rag framework , author=. arXiv preprint arXiv:2510.12323 , year=

arXiv

[7] [9]

arXiv preprint arXiv:2508.07597 , year=

ShoulderShot: Generating Over-the-Shoulder Dialogue Videos , author=. arXiv preprint arXiv:2508.07597 , year=

arXiv

[8] [10]

arXiv preprint arXiv:2504.02436 , year=

Skyreels-a2: Compose anything in video diffusion trans formers , author=. arXiv preprint arXiv:2504.02436 , year=

arXiv

[9] [11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Dreamrunner: Fine-grained compositional story-to-video generation with retrieval-augmented motion adaptation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[10] [12]

Proceedings of the computer vision and pattern recognition conference , pages=

Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems , author=. Proceedings of the computer vision and pattern recognition conference , pages=

[11] [13]

2026 , howpublished =

Wan2.6 , author =. 2026 , howpublished =

2026

[12] [14]

2026 , howpublished =

Vidu Q3 Turbo , author =. 2026 , howpublished =

2026

[13] [15]

2026 , howpublished =

Nano Banana 2: Gemini AI Image Generator and Photo Editor , author =. 2026 , howpublished =

2026

[14] [16]

2026 , howpublished =

Seedream 5.0 Lite , author =. 2026 , howpublished =

2026

[15] [17]

2026 , howpublished =

Multimodal Embedding API Reference , author =. 2026 , howpublished =

2026

[16] [18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Genmac: compositional text-to-video generation with multi-agent collaboration , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[17] [20]

arXiv preprint arXiv:2511.08521 , year=

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist , author=. arXiv preprint arXiv:2511.08521 , year=

arXiv

[18] [21]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Videoauteur: Towards long narrative video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[19] [22]

arXiv preprint arXiv:2410.05779 , volume=

Lightrag: Simple and fast retrieval-augmented generation , author=. arXiv preprint arXiv:2410.05779 , volume=

Pith/arXiv arXiv

[20] [23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Patho-AgenticRAG: towards multimodal agentic retrieval-augmented generation for pathology VLMs via reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[21] [24]

Tongyi-Embedding-Vision: Multimodal Embedding API , year =

[22] [25]

arXiv preprint arXiv:2311.15127 , year =

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author =. arXiv preprint arXiv:2311.15127 , year =

Pith/arXiv arXiv

[23] [26]

arXiv preprint arXiv:2410.13720 , year =

Movie Gen: A Cast of Media Foundation Models , author =. arXiv preprint arXiv:2410.13720 , year =

Pith/arXiv arXiv

[24] [27]

arXiv preprint arXiv:2311.04145 , year =

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models , author =. arXiv preprint arXiv:2311.04145 , year =

Pith/arXiv arXiv

[25] [28]

arXiv preprint arXiv:2209.14958 , year =

Co-Writing Screenplays and Theatre Scripts with Language Models: An Evaluation by Industry Professionals , author =. arXiv preprint arXiv:2209.14958 , year =

arXiv

[26] [29]

arXiv preprint arXiv:2309.15091 , year =

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning , author =. arXiv preprint arXiv:2309.15091 , year =

arXiv

[27] [30]

arXiv preprint arXiv:2411.04925 , year =

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration , author =. arXiv preprint arXiv:2411.04925 , year =

arXiv

[28] [31]

arXiv preprint arXiv:2503.05242 , year =

MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio , author =. arXiv preprint arXiv:2503.05242 , year =

arXiv

[29] [32]

arXiv preprint arXiv:2408.09333 , year =

SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama , author =. arXiv preprint arXiv:2408.09333 , year =

arXiv

[30] [33]

arXiv preprint arXiv:2602.21818 , year =

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model , author =. arXiv preprint arXiv:2602.21818 , year =

arXiv

[31] [34]

arXiv preprint arXiv:2412.20725 , year =

Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling , author =. arXiv preprint arXiv:2412.20725 , year =

arXiv

[32] [35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[33] [36]

arXiv preprint arXiv:2302.05543 , year =

Adding Conditional Control to Text-to-Image Diffusion Models , author =. arXiv preprint arXiv:2302.05543 , year =

Pith/arXiv arXiv

[34] [37]

arXiv preprint arXiv:2311.16933 , year =

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models , author =. arXiv preprint arXiv:2311.16933 , year =

arXiv

[35] [38]

arXiv preprint arXiv:2309.00398 , year =

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation , author =. arXiv preprint arXiv:2309.00398 , year =

arXiv

[36] [39]

arXiv preprint arXiv:2310.12190 , year =

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors , author =. arXiv preprint arXiv:2310.12190 , year =

arXiv

[37] [40]

arXiv preprint arXiv:2404.02101 , year =

CameraCtrl: Enabling Camera Control for Text-to-Video Generation , author =. arXiv preprint arXiv:2404.02101 , year =

Pith/arXiv arXiv

[38] [41]

arXiv preprint arXiv:2410.15957 , year =

CamI2V: Camera-Controlled Image-to-Video Diffusion Model , author =. arXiv preprint arXiv:2410.15957 , year =

arXiv

[39] [42]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

[40] [43]

arXiv preprint arXiv:2501.12948 , year =

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. arXiv preprint arXiv:2501.12948 , year =

Pith/arXiv arXiv

[41] [44]

arXiv preprint arXiv:2503.06749 , year =

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author =. arXiv preprint arXiv:2503.06749 , year =

Pith/arXiv arXiv

[42] [45]

arXiv preprint arXiv:2501.09099 , year=

Drama llama: An llm-powered storylets framework for authorable responsiveness in interactive narrative , author=. arXiv preprint arXiv:2501.09099 , year=

arXiv

[43] [46]

arXiv preprint arXiv:2506.18899 , year=

Filmaster: Bridging cinematic principles and generative ai for automated film generation , author=. arXiv preprint arXiv:2506.18899 , year=

arXiv

[44] [47]

Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Audience in the loop: Viewer feedback-driven content creation in micro-drama production on social media , author=. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

2026

[45] [48]

arXiv preprint arXiv:2603.02681 , year =

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation , author =. arXiv preprint arXiv:2603.02681 , year =

arXiv

[46] [49]

arXiv preprint arXiv:2603.08812 , year =

VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model , author =. arXiv preprint arXiv:2603.08812 , year =

arXiv

[47] [50]

arXiv preprint arXiv:2505.24073 , year =

mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation , author =. arXiv preprint arXiv:2505.24073 , year =

arXiv

[48] [51]

arXiv preprint arXiv:2602.09609 , year =

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing , author =. arXiv preprint arXiv:2602.09609 , year =

arXiv

[49] [52]

arXiv preprint arXiv:2604.09195 , year =

Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation , author =. arXiv preprint arXiv:2604.09195 , year =

Pith/arXiv arXiv

[50] [53]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

A Survey on LLMs for Story Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

2025

[51] [54]

arXiv preprint arXiv:2507.18634 , year =

Captain Cinema: Towards Short Movie Generation , author =. arXiv preprint arXiv:2507.18634 , year =

arXiv

[52] [55]

arXiv preprint arXiv:2510.23163 , year =

Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs , author =. arXiv preprint arXiv:2510.23163 , year =

arXiv

[53] [56]

International Conference on Learning Representations , year =

VADER: Video Diffusion Alignment via Reward Gradients , author =. International Conference on Learning Representations , year =

[54] [57]

arXiv preprint arXiv:2512.12372 , year =

STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative , author =. arXiv preprint arXiv:2512.12372 , year =

arXiv

[55] [58]

arXiv preprint arXiv:2604.03315 , year =

StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics , author =. arXiv preprint arXiv:2604.03315 , year =

Pith/arXiv arXiv

[56] [59]

arXiv preprint arXiv:2512.07802 , year =

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory , author =. arXiv preprint arXiv:2512.07802 , year =

arXiv

[57] [61]

IEEE/CVF International Conference on Computer Vision (ICCV) , year=

VideoAuteur: Towards long narrative video generation , author=. IEEE/CVF International Conference on Computer Vision (ICCV) , year=

[58] [62]

arXiv preprint arXiv:2507.00001 , year=

Long Context Tuning for multi-shot video generation , author=. arXiv preprint arXiv:2507.00001 , year=

Pith/arXiv arXiv

[59] [63]

arXiv preprint arXiv:2503.07314 , year=

MovieBench: Hierarchical annotations for movie generation , author=. arXiv preprint arXiv:2503.07314 , year=

arXiv

[60] [64]

International Conference on Learning Representations (ICLR) 2025 , year=

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences , author=. International Conference on Learning Representations (ICLR) 2025 , year=

2025

[61] [65]

arXiv preprint arXiv:2512.19539 , year=

StoryMem: Memory-Augmented Video Storytelling , author=. arXiv preprint arXiv:2512.19539 , year=

arXiv

[62] [66]

arXiv preprint arXiv:2506.01908 , year=

Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency , author=. arXiv preprint arXiv:2506.01908 , year=

arXiv

[63] [67]

arXiv preprint arXiv:2505.23990 , year=

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding , author=. arXiv preprint arXiv:2505.23990 , year=

arXiv

[64] [68]

arXiv preprint arXiv:2604.05418 , year=

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG , author=. arXiv preprint arXiv:2604.05418 , year=

Pith/arXiv arXiv

[65] [69]

NeurIPS 2025 , year=

EchoShot: Multi-Shot Portrait Video Generation , author=. NeurIPS 2025 , year=

2025

[66] [70]

arXiv preprint arXiv:2406.09414 , year=

Depth Anything V2 , author=. arXiv preprint arXiv:2406.09414 , year=

Pith/arXiv arXiv

[67] [71]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Effective Whole-Body Pose Estimation with Two-Stages Distillation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=