pith. machine review for the scientific record. sign in

arxiv: 2605.15199 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords EntityBenchmulti-shot video generationentity consistencycross-shot consistencymemory-augmented generationvideo benchmarkcharacter fidelitynarrative video
0
0 comments X

The pith

Explicit per-entity memory maintains character consistency across long gaps in multi-shot video generation where existing methods fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EntityBench, a dataset of 140 episodes and 2,491 shots drawn from real narrative media, complete with per-shot schedules for characters, objects, and locations that span easy-to-hard tiers and recurrence distances up to 48 shots. It pairs the benchmark with an evaluation protocol that separates intra-shot visual quality, prompt alignment, and cross-shot entity consistency, using a fidelity gate to ensure only accurate appearances count toward consistency scores. Experiments demonstrate that consistency in current video models drops sharply as the gap between a character's appearances increases. The authors then present EntityMem, which stores verified per-entity visual references in a persistent bank before generation and achieves the strongest character fidelity and presence of any method tested. A sympathetic reader would care because coherent multi-shot narratives require reliable entity identity over time, yet no prior standardized test has isolated this failure mode at scale.

Core claim

EntityBench supplies explicit entity schedules across 140 episodes with up to 13 recurring characters, 8 locations, and 22 objects per episode, while EntityMem stores verified per-entity visual references in a memory bank and produces the highest cross-shot character fidelity (Cohen's d = +2.33) and presence among evaluated systems; existing methods show sharp degradation in consistency as recurrence distance grows.

What carries the argument

Persistent per-entity memory bank that stores verified visual references before generation begins and retrieves them for each subsequent shot.

If this is right

  • Cross-shot entity consistency in existing video models falls sharply as the number of shots between appearances increases.
  • Storing verified per-entity references in a memory bank produces the largest measured gains in character fidelity and presence.
  • The three-pillar evaluation separates intra-shot quality, prompt following, and cross-shot consistency so each can be measured independently.
  • Benchmarks that track multiple entity types simultaneously across up to 50 shots expose failure modes missed by simpler prompt sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Video generation pipelines could adopt similar memory banks as a default module to support longer coherent stories.
  • The same entity-schedule format could be reused to test consistency in image-to-video or text-to-3D pipelines.
  • Recurrence-distance curves may become a standard diagnostic plot for any multi-shot generation system.
  • If the fidelity gate proves robust, it could be applied to filter training data for future models.

Load-bearing premise

Entity schedules extracted from real narrative media together with the fidelity gate used for scoring accurately capture the consistency problems that current video models actually face.

What would settle it

A new generation method that achieves equal or higher character fidelity and presence scores on the 140 EntityBench episodes without any per-entity memory bank would refute the claim that explicit memory is required.

Figures

Figures reproduced from arXiv: 2605.15199 by Meng Wei, Ruozhen He, Vicente Ordonez, Ziyan Yang.

Figure 1
Figure 1. Figure 1: Overview of the EntityBench evaluation suite. Three pillars progressively assess whether [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison on a representative episode. Multiple characters recur in shots 1, 3, 4, 7, 8. EntityMem preserves all four characters identity, while changing locations according to the prompt. rather than being averaged into a shared per-shot context. When a shot needs to depict a character, the model conditions on a tight, per-entity description that survives across shots without being diluted by… view at source ↗
Figure 3
Figure 3. Figure 3: Per-episode entity counts (declared in the registry), broken down by entity type. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-shot entity-load distributions, broken down by type. Location counts cluster tightly at [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Complementary CDF of per-entity maximum reappearance gap, stratified by tier. The [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Continuation-chain length distribution (number of consecutive shots between two cuts). [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-entity persistence statistics. Left: number of shots an entity appears in (median 2; right tail extends past 25 appearances). Right: longest consecutive-shot run an entity sustains (median 1; the right tail corresponds to anchor entities across multi-shot continuation segments). 0 10 20 30 40 50 Global shot index within episode 0 1 2 3 4 5 Avg # new entities introduced Where in the episode are entities… view at source ↗
Figure 8
Figure 8. Figure 8: Average number of new entities introduced at each shot index (left axis, blue), with the [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ENTITYBENCH Example 1: story overview and entity registry. The header reports the structural counts (scenes, shots, characters, locations, objects). The registry below, with chip color indicating entity type, is at the bottom. shot followed by a six-shot continuation chain in The Scholar’s Study, a one-shot interlude in The Quiet Room, and a final three-shot continuation back in the study [PITH_FULL_IMAGE… view at source ↗
Figure 10
Figure 10. Figure 10: ENTITYBENCH Example 1: entity-persistence strip. Rows are entities in registry order and columns are shots in story order. A filled cell means the entity is scheduled in that shot. Solid vertical rules separate scenes; dashed rules mark within-scene hard cuts. E AGENT PROMPTS This section provides the full text of every prompt used by the four EntityMem agents, including Classification, Portrait, Verifica… view at source ↗
Figure 11
Figure 11. Figure 11: ENTITYBENCH Example 1: shot timeline. Each row is one shot. The verbatim action_descriptions with every entity that the shot’s entity_schedule references, bolded and tinted in its type color. Hard cuts are flagged with bold shot indices and a tinted row background. Best-candidate selection. After the image generator (Labs, 2024) produces N=5 candidates per entity and SAM2 (Ravi et al., 2024) segments each… view at source ↗
Figure 12
Figure 12. Figure 12: ENTITYBENCH Example 2: story overview and entity registry. E.3 VERIFICATION AGENT After selection, the Verification Agent inspects the chosen segmented portrait for the failure modes that defeat downstream compositing: missing body regions, see-through clothing, etc. A failed verification triggers a retry with an alternative chroma-key color, addressing the common case where a part of the foreground match… view at source ↗
Figure 13
Figure 13. Figure 13: ENTITYBENCH Example 2: entity-persistence strip. a structured plan of one or more keyframes, each with the participating entities, their positions on a discrete 7-cell horizontal grid, and the camera angle (front/left/right) to use as background. The prompt explicitly walks the agent through camera-pan reasoning so that characters retained across a continuation translate the correct way as the camera move… view at source ↗
Figure 14
Figure 14. Figure 14: ENTITYBENCH Example 2: shot timeline, part 1 of 2. Each row is one shot; the verbatim action_descriptions text appears with every entity that the shot’s entity_schedule references bolded and tinted in its type color. Hard cuts are flagged with bold shot indices and a tinted row background. F.2 PER-METHOD COVERAGE AND RAW MEANS The fidelity-gate-corrected means in the main paper ( [PITH_FULL_IMAGE:figures… view at source ↗
Figure 15
Figure 15. Figure 15: ENTITYBENCH Example 2: shot timeline, part 2 of 2 (continuation of [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt used by the Classification Agent to decide whether an object entity requires a [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt used by the Portrait Agent to write a character-specific prompt. The first [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt used by the Portrait Agent for objects that the Classification Agent flagged as [PITH_FULL_IMAGE:figures/full_fig_p040_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt used by the Portrait Agent to write a panoramic-shot image generation prompt for [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Vision-language prompt used by the Portrait Agent to select the best of [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt used by the Verification Agent to gate portraits before they enter the memory bank. [PITH_FULL_IMAGE:figures/full_fig_p041_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Layout Agent prompt, part 1 of 2: input fields and the global task rules. [PITH_FULL_IMAGE:figures/full_fig_p042_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Layout Agent prompt, part 2 of 2: continuation-shot reasoning, hard-cut defaults, the [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: DINOv2 similarity measures consistency in a different way from LLM. [PITH_FULL_IMAGE:figures/full_fig_p044_24.png] view at source ↗
read the original abstract

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, featuring explicit per-shot entity schedules for characters, objects, and locations across easy/medium/hard tiers with recurrence gaps up to 48 shots. It defines a three-pillar evaluation (intra-shot quality, prompt alignment, cross-shot consistency) that incorporates a fidelity gate to filter entity appearances before scoring consistency. As a baseline, the authors propose EntityMem, which maintains a persistent memory bank of verified per-entity visual references, and report that existing methods show sharp consistency degradation with recurrence distance while EntityMem achieves the highest character fidelity (Cohen's d = +2.33) and presence.

Significance. If the benchmark construction and fidelity gate are shown to be robust, the work would supply a standardized, entity-rich evaluation resource for long-range multi-shot video generation that improves on prior prompt-only or short-sequence tests. The empirical demonstration that explicit per-entity memory outperforms recurrence-based methods on character fidelity would provide a concrete engineering direction for narrative video systems, with the released code and data increasing its immediate utility.

major comments (3)
  1. [§3.2] §3.2 (fidelity gate): The gate is described as admitting only accurate entity appearances into cross-shot scoring, yet no implementation details (embedding threshold, reference selection, or human judgment protocol), sensitivity analysis on the threshold, or inter-annotator agreement are reported. Because the degradation curves and Cohen's d = +2.33 result are computed exclusively on gate-passing entities, this omission directly undermines the central empirical claims.
  2. [§3.1] §3.1 (entity schedule extraction): The process of deriving per-shot schedules and assigning easy/medium/hard tiers from real narrative media is outlined but lacks validation metrics such as inter-annotator agreement or agreement with model failure modes. The recurrence-distance degradation result and tier-wise comparisons rest on these schedules accurately reflecting the consistency problem; without such checks the benchmark's external validity is unclear.
  3. [Experiments / Table 3] Experiments section / Table 3: Baseline comparisons do not report model sizes, training data volumes, or hyperparameter controls for the evaluated generators. Without these controls it is impossible to isolate whether the reported +2.33 Cohen's d advantage is attributable to the EntityMem memory bank or to differences in underlying model capacity.
minor comments (3)
  1. [Figure 2] Figure 2: The memory-bank diagram would be clearer with an explicit arrow or caption indicating the verification step before storage.
  2. [§4] §4: The recurrence-distance metric is used throughout but never given an explicit equation; adding one would remove ambiguity when comparing to prior work.
  3. [References] References: Several recent multi-shot video papers on identity preservation are absent; adding them would strengthen the related-work positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the manuscript requires additional details or analysis, we will revise accordingly to strengthen the presentation of EntityBench and EntityMem.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (fidelity gate): The gate is described as admitting only accurate entity appearances into cross-shot scoring, yet no implementation details (embedding threshold, reference selection, or human judgment protocol), sensitivity analysis on the threshold, or inter-annotator agreement are reported. Because the degradation curves and Cohen's d = +2.33 result are computed exclusively on gate-passing entities, this omission directly undermines the central empirical claims.

    Authors: We agree that the current description of the fidelity gate lacks sufficient implementation details. In the revised manuscript we will expand §3.2 to specify the embedding model and exact threshold used for verification, the protocol for selecting reference images from the memory bank, and the human judgment protocol. We will also report inter-annotator agreement for the verification step and include a sensitivity analysis showing how Cohen's d and degradation curves vary with threshold choice. These additions will directly support the robustness of the reported results. revision: yes

  2. Referee: [§3.1] §3.1 (entity schedule extraction): The process of deriving per-shot schedules and assigning easy/medium/hard tiers from real narrative media is outlined but lacks validation metrics such as inter-annotator agreement or agreement with model failure modes. The recurrence-distance degradation result and tier-wise comparisons rest on these schedules accurately reflecting the consistency problem; without such checks the benchmark's external validity is unclear.

    Authors: We acknowledge the value of explicit validation metrics. In the revision we will add inter-annotator agreement statistics for both the per-shot entity schedule derivation and the easy/medium/hard tier assignment. We will further include a short analysis comparing the defined tiers against observed failure modes of the evaluated models. These metrics will help confirm that the schedules accurately capture the long-range consistency challenge and thereby support the recurrence-distance and tier-wise findings. revision: yes

  3. Referee: Experiments section / Table 3: Baseline comparisons do not report model sizes, training data volumes, or hyperparameter controls for the evaluated generators. Without these controls it is impossible to isolate whether the reported +2.33 Cohen's d advantage is attributable to the EntityMem memory bank or to differences in underlying model capacity.

    Authors: The generators evaluated are off-the-shelf models from prior publications; we used their publicly released implementations without retraining. In the revised Experiments section we will add a table listing parameter counts, training-data descriptions, and the hyperparameter settings employed during our runs. Because EntityMem is applied as a plug-in memory augmentation on top of each base generator, our primary comparisons hold the underlying model fixed and vary only the presence of the memory bank. While exhaustive capacity-matched retraining is outside the scope of this benchmark paper, the added details will allow readers to assess potential capacity confounds. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external data and explicit engineering choices

full rationale

The paper constructs EntityBench from real narrative media with per-shot entity schedules and introduces a fidelity gate as part of its three-pillar evaluation. It then evaluates an explicit baseline (EntityMem) against other methods on this benchmark. No equations, derivations, or predictions are present that reduce by construction to fitted inputs, self-citations, or ansatzes. The central results are comparative empirical measurements on held-out data, with no load-bearing step that renames or re-derives its own inputs. This is a standard benchmark paper whose claims rest on the external validity of the media-derived schedules rather than internal self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the extracted entity schedules from real media form a valid test distribution and that the fidelity gate correctly isolates appearance accuracy; no free parameters are described in the abstract.

axioms (1)
  • domain assumption Entity schedules derived from real narrative media capture representative consistency challenges for multi-shot video generation
    Invoked when constructing the 140 episodes and tiering them into easy/medium/hard.
invented entities (1)
  • EntityMem persistent memory bank no independent evidence
    purpose: Stores verified per-entity visual references for use during generation of later shots
    New component introduced as the baseline method; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5546 in / 1370 out tokens · 52546 ms · 2026-05-15T03:10:57.117220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

  1. [1]

    Mixture of contexts for long video generation

    Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058,

  2. [2]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074,

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,

  4. [4]

    Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245,

    Xiaokun Feng, Haiming Yu, Meiqi Wu, Shiyu Hu, Jintao Chen, Chen Zhu, Jiahong Wu, Xiangxiang Chu, and Kaiqi Huang. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245,

  5. [5]

    Longvie: Multimodal-guided controllable ultra-long video generation

    Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, and Ziwei Liu. Longvie: Multimodal-guided controllable ultra-long video generation.arXiv preprint arXiv:2508.03694,

  6. [6]

    Id- animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275,

    Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id- animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275,

  7. [7]

    Filmaster: Bridging cinematic principles and generative ai for automated film generation.arXiv preprint arXiv:2506.18899, 2025a

    Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, and Xihui Liu. Filmaster: Bridging cinematic principles and generative ai for automated film generation.arXiv preprint arXiv:2506.18899, 2025a. Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In...

  8. [8]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

  9. [9]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a. Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. In Proceed...

  10. [10]

    Shotstream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746,

    Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746,

  11. [11]

    Identity- grpo: Optimizing multi-human identity-preserving video generation via reinforcement learning

    Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin, and Weizhi Wang. Identity- grpo: Optimizing multi-human identity-preserving video generation via reinforcement learning. arXiv preprint arXiv:2510.14256, 2025a. Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al. H...

  12. [12]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

  13. [13]

    Msvbench: Towards human-level evaluation of multi-shot video generation.arXiv preprint arXiv:2602.23969,

    Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. Msvbench: Towards human-level evaluation of multi-shot video generation.arXiv preprint arXiv:2602.23969,

  14. [14]

    Storybooth: Training-free multi-subject consistency for improved visual storytelling.arXiv preprint arXiv:2504.05800,

    Jaskirat Singh, Junshen Kevin Chen, Jonas Kohler, and Michael Cohen. Storybooth: Training-free multi-subject consistency for improved visual storytelling.arXiv preprint arXiv:2504.05800,

  15. [15]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

  16. [16]

    Echoshot: Multi-shot portrait video generation

    Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, and Jieping Ye. Echoshot: Multi-shot portrait video generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025a. Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, and Xu ...

  17. [17]

    Moviebench: A hierarchical movie level dataset for long video generation

    Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, and Mike Zheng Shou. Moviebench: A hierarchical movie level dataset for long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 28984–28994, 2025a. Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, and Xinyuan Chen. Cin...

  18. [18]

    Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework.arXiv preprint arXiv:2408.11788,

    Zhifei Xie, Daniel Tang, Dingwei Tan, Jacques Klein, Tegawend F Bissyand, and Saad Ezzini. Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework.arXiv preprint arXiv:2408.11788,

  19. [19]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622,

  20. [20]

    Shotverse: Advancing cinematic camera control for text-driven multi-shot video creation.arXiv preprint arXiv:2603.11421,

    Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, and Anyi Rao. Shotverse: Advancing cinematic camera control for text-driven multi-shot video creation.arXiv preprint arXiv:2603.11421,

  21. [21]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

  22. [22]

    Infinity- rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649,

    Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity- rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649,

  23. [23]

    Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025a. Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-vi...

  24. [24]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755,

  25. [25]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

  26. [26]

    Concat-id: Towards universal identity-preserving video synthesis

    Yong Zhong, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, and Chongxuan Li. Concat-id: Towards universal identity-preserving video synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1906–1915,

  27. [27]

    Videomemory: Toward consistent video generation via memory integration.arXiv preprint arXiv:2601.03655,

    Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, and Ying-cong Chen. Videomemory: Toward consistent video generation via memory integration.arXiv preprint arXiv:2601.03655,

  28. [28]

    2c+1o” denotes “≥2 characters and ≥1 object,

    plus a fixed-length stress test, measuring both typical-case behavior (easy/medium) and worst-case scaling (hard) within a tractable compute budget. 14 Preprint. 2 4 6 8 10 12 Characters / episode 0 5 10 15 20 25 30# episodes median=7.0 10 20 30 40 50 Objects / episode 0 5 10 15 20 25 30# episodes median=13.0 2 4 6 8 10 12 14 Locations / episode 0 10 20 3...

  29. [29]

    image and text embeddings respec- tively (jointly trained, 512-dim, unit-normalized). For an image x and text t, the CLIP text-image similarity is CLIPsim(x, t) =ϕ img CLIP(x)⊤ϕtxt CLIP(t)∈[−1,1].(1) Grounding.Let G denote the GroundingDINO(Liu et al., 2024a) detector with text encoder bert-base-uncased. For frame f and query q, G(f, q) returns a (possibl...

  30. [30]

    did the action use the object correctly

    optical flow is used to interpolate intermediate frames; MS(Sk) is the mean reconstruction quality of the interpolation, with higher values indicating smoother apparent motion. Implementation follows Huang et al. (2024b). 21 Preprint. Dynamic degree(range [0,1] ): the fraction of inter-frame pairs whose RAFT optical flow magni- tude exceeds a threshold; p...

  31. [31]

    This appendix decomposes the corrected means into their two components for transparency

    and Appendix F.1 aggregate asm= rawmean(m)×coverage(m) , where coverage is the fraction of eligible (shot, entity) instances that pass the fidelity gate (Equation 22). This appendix decomposes the corrected means into their two components for transparency. What coverage measures.For each per-entity metric, coverage answers a different question: For Pillar...

  32. [32]

    right" with camera_angle=

    Place new characters on the side the camera panned toward Example: Previous shot had CharA at "right" with camera_angle="front". New shot introduces CharB and CharC. The camera should pan right to make room → camera_angle="right". CharA moves to "left" or "center-left". CharB and CharC enter at "center-right" and "right". If the retained character was at ...

  33. [33]

    47 Preprint

    shows little gap effect for any method, as DINOv2 cosine similarity reflects identity differently, discussed in Appendix F.4. 47 Preprint. Table 20:DINOv2 face similarity by gap distance.DINOv2 cosine similarity (mean of adjacent- pair sims to centroid) shows little gap effect across methods, consistent with embedding-similarity rewarding visual self-simi...

  34. [34]

    and universal identity-preserving synthesis (Zhong et al., 2025). However, these methods focus primarily on human facial identity for one or two subjects, leaving broader entity types, such as objects, locations, and character ensembles, largely unaddressed. LLM-Directed Video Generation.LLMs have been used as video planners to produce scene descriptions ...