pith. sign in

arxiv: 2510.20206 · v2 · pith:727573YJnew · submitted 2025-10-23 · 💻 cs.CV

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Pith reviewed 2026-05-18 05:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords prompt optimizationtext-to-video generationdiffusion modelsretrieval augmentationtest-time scalingLLM fine-tuningsemantic alignmenttemporal coherence
0
0 comments X p. Extension
pith:727573YJ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{727573YJ}

Prints a linked pith:727573YJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Cross-stage prompt optimization substantially improves semantic alignment, composition, and temporal stability in text-to-video generation across multiple models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a unified cross-stage prompt optimization framework can substantially improve text-to-video generation by better aligning user prompts with model training data and using iterative test-time refinements. If this holds, it would mean that existing diffusion models could produce videos with superior semantic accuracy, object composition, motion consistency, and realistic physics simply through smarter prompting. The approach combines retrieval from semantic graphs, closed-loop feedback on multiple quality signals, and LLM fine-tuning to internalize the optimizations. A general reader would care because prompt design is accessible and cheap compared to retraining large models, potentially democratizing high-quality video generation. Experiments across models and benchmarks support large gains over existing prompt methods.

Core claim

RAPO++ unifies retrieval-augmented prompt optimization that enriches prompts with modifiers from a relation graph and refactors them to match training distributions, closed-loop sample-specific prompt optimization that iteratively refines prompts using multi-source feedback on semantic alignment, spatial fidelity, temporal coherence, and optical flow, and fine-tuning of the rewriter LLM on optimized prompt pairs to internalize task-specific patterns for efficient generation.

What carries the argument

The three-stage cross-stage prompt optimization process that performs data-aligned refinement, feedback-driven iterative scaling, and LLM internalization to improve outputs without altering the generative backbone.

If this is right

  • Generated videos show improved handling of multiple objects and complex scene compositions.
  • Temporal stability increases with fewer motion artifacts and better frame-to-frame consistency.
  • Physical plausibility rises as depicted actions and object interactions become more realistic.
  • Gains appear consistently when the method is applied to different underlying text-to-video models.
  • After fine-tuning, the rewriter LLM produces high-quality prompts efficiently at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The staged approach could extend to text-to-image or text-to-3D tasks where similar prompt-model mismatches limit output quality.
  • Adding user preference signals to the feedback loop might enable more personalized video outputs over time.
  • Testing the method on longer video sequences could show whether coherence holds beyond short clips.

Load-bearing premise

The multi-source feedback signals in the closed-loop stage provide reliable guidance that consistently improves generation quality rather than introducing new artifacts or biases.

What would settle it

Applying the full pipeline to a held-out text-to-video model and benchmark and finding no gains or actual drops in metrics for semantic alignment and temporal coherence would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.20206 by Bingjie Gao, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Li Niu, Qianli Ma, Qingyang Liu, Shuai Yang, Xiaoxue Wu, Xinyuan Chen, Yaohui Wang, Yu Qiao.

Figure 1
Figure 1. Figure 1: Overview of RAPO++. The framework couples training-data–aligned prompt refinement with test-time scaling to enhance Text-to-Video (T2V) generation without altering the generative backbone. Stage 1 RAPO: Retrieval-Augmented Prompt Optimization (Sec. 3). User prompts are augmented via a retrieval-based relation graph and refactored by a fine￾tuned LLM, while a frozen LLM provides alternative rewrites. A disc… view at source ↗
Figure 2
Figure 2. Figure 2: Generation results under different iterations of prompt refinement at inference utilizing SSPO. The initial prompt is “valkyrie riding flying horses through the clouds”. As the number of iterations increases (from left to right), the generated video becomes more detailed and vivid, and more consistent with the user’s intent. scaling. The framework is organized into three comple￾mentary stages. In the first… view at source ↗
Figure 3
Figure 3. Figure 3: The construction of relation graph. Relation graph consists of multiple nodes (scenes acting as core nodes with modifiers connected as sub-nodes). For each prompt in database, LLM extracts scene and related mod￾ifiers. Based on whether the extracted scene is already in the graph or not, different methods are used to incorporate the new information into the graph. generation. We proceed to introduce each mo… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons across dynamic and static dimensions. This figure showcases videos generated using LaVie with short prompts, GPT-4 and Open-sora prompt optimizations, and our RAPO method. Videos produced with RAPO exhibit significantly sharper spatial details, smoother temporal transitions, and a closer semantic alignment with the input text. improvements through both quantitative and qualitative c… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons using LaVie with initial prompts (left) and optimized prompts from RAPO++ (right). We present qualitative comparisons from the dynamic and static dimension. The videos generated by RAPO++ exhibit sharper details, smoother temporal transitions, and better alignment with the input text. evaluation setting. The results in Tab. 8 and Tab. 9 clearly show that in￾tegrating Task-Specific A… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization on attention map on multiple objects from different prompts. Adding description of the relative spatial position between objects can improve multi￾object generation [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples illustrating the limitation of RAPO++ in numeracy-related compositional tasks. Given prompts ”Five colorful parrots perch on a tree branch” (left) and ”Three majestic giraffes graze on the leaves of tall trees in the African savannah, their long necks reaching high, Salvador Dali style” (right), the generated frames fail to accurately match the specified object counts, highlighting per… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt length distribution comparison among various methods. The distribution of RAPO-optimized prompts is more closer to the training prompts. the model’s generative potential to produce better videos. In contrast, user prompts are too short and lack necessary details, while other methods generate longer prompts that contain excessive details and complex vocabulary, which may be counterproductive, as show… view at source ↗
Figure 9
Figure 9. Figure 9: A complex unusual example (a panda bear in a red apron and name tag works as a cashier in a Chinese New Year-themed supermarket) generated by initial prompt (left) or optimized prompt (right). The generated video from optimized prompt is more consistent with initial prompt and user intention. specificity, clearer structure, stronger contextual emphasis, and explicit handling of the unusual concept (a panda… view at source ↗
Figure 10
Figure 10. Figure 10: Inference-time scaling performance tested on temporal consistency, visual quality, T2V alignment, and factual consistency. We conduct experiments using LaVie [27] and utilize 2.2k T2V prompts provided in [7]. Each metric exhibits a consistent upward trajectory as iteration count increases, underscoring the effectiveness of RAPO++ in enhancing generative performance. TABLE 10: Ablation studies of different… view at source ↗
read the original abstract

Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents RAPO++, a three-stage cross-stage prompt optimization framework for text-to-video (T2V) generation. Stage 1 (RAPO) retrieves semantically relevant modifiers from a relation graph to enrich and refactor user prompts for better alignment with training data distributions. Stage 2 (SSPO) performs closed-loop iterative refinement of prompts using multi-source feedback signals (semantic alignment, spatial fidelity, temporal coherence, and optical flow). Stage 3 fine-tunes the rewriter LLM on pairs of original and SSPO-optimized prompts. The central empirical claim is that RAPO++ yields significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming prior methods by large margins across five T2V models and five benchmarks, while remaining model-agnostic and without modifying the generative backbone. Code is released at https://github.com/Vchitect/RAPO.

Significance. If the empirical results hold after addressing validation concerns, this would constitute a meaningful contribution to prompt engineering for generative video models by providing a scalable, training-free (at inference) optimization pipeline that combines retrieval, test-time iteration, and distillation. The public code release supports reproducibility. The approach is noteworthy for its model-agnostic nature, but its significance depends on demonstrating that the SSPO feedback loop produces genuine quality improvements rather than metric-specific artifacts.

major comments (2)
  1. [§3.2] §3.2 (SSPO closed-loop description): The composite reward formed from semantic alignment, spatial fidelity, temporal coherence, and optical flow is presented as providing reliable guidance for iterative prompt refinement, yet no calibration study, human correlation analysis, or bias audit is reported. Because the scoring models are themselves imperfect on long-horizon video properties, the loop risks converging on prompts that exploit scorer artifacts rather than improving human-perceived quality. This directly underpins the large-margin gains attributed to Stage 2 and must be addressed with concrete evidence (e.g., human preference studies or failure-case analysis).
  2. [§5] §5 (Experiments): The headline claim of large-margin outperformance across five T2V models and five benchmarks is stated without accompanying quantitative tables, per-stage ablations, or controls that isolate SSPO's contribution from RAPO and LLM fine-tuning. Detailed metrics (e.g., specific CLIP, temporal consistency, or human eval scores) and statistical significance tests are required to substantiate the central empirical assertion.
minor comments (2)
  1. [§3.1] The construction and data sources of the 'relation graph' in Stage 1 are only briefly mentioned; a short appendix or paragraph detailing its creation would improve reproducibility.
  2. [§5] Figure captions and axis labels in the experimental section could be expanded to explicitly state which metric corresponds to which feedback signal used in SSPO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the validation of the SSPO stage and the overall empirical claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (SSPO closed-loop description): The composite reward formed from semantic alignment, spatial fidelity, temporal coherence, and optical flow is presented as providing reliable guidance for iterative prompt refinement, yet no calibration study, human correlation analysis, or bias audit is reported. Because the scoring models are themselves imperfect on long-horizon video properties, the loop risks converging on prompts that exploit scorer artifacts rather than improving human-perceived quality. This directly underpins the large-margin gains attributed to Stage 2 and must be addressed with concrete evidence (e.g., human preference studies or failure-case analysis).

    Authors: We agree that explicit validation of the composite reward against human judgments is necessary to rule out potential exploitation of scorer artifacts. The individual feedback signals draw from established video evaluation practices, but the current manuscript does not report a dedicated calibration or human correlation study. In the revised version we will add a human preference study with multiple annotators comparing videos from original, RAPO, and SSPO prompts, together with a failure-case analysis that examines cases where the loop improves or fails to improve perceived quality. These additions will directly address the concern. revision: yes

  2. Referee: [§5] §5 (Experiments): The headline claim of large-margin outperformance across five T2V models and five benchmarks is stated without accompanying quantitative tables, per-stage ablations, or controls that isolate SSPO's contribution from RAPO and LLM fine-tuning. Detailed metrics (e.g., specific CLIP, temporal consistency, or human eval scores) and statistical significance tests are required to substantiate the central empirical assertion.

    Authors: Section 5 already contains quantitative tables comparing RAPO++ against baselines across the five models and benchmarks. To better isolate SSPO's contribution we will add explicit per-stage ablation tables and controls in the revision. We will also expand the reported metrics with specific CLIP, temporal consistency, and human evaluation scores, and include statistical significance tests (e.g., paired t-tests) to support the claimed margins. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical prompt-optimization pipeline (RAPO retrieval, SSPO closed-loop feedback from auxiliary scorers, and LLM fine-tuning) whose performance claims rest on external benchmarks across five T2V models rather than any mathematical derivation. No equations, fitted parameters, or self-citations are shown to reduce the reported gains to inputs by construction. The SSPO loop employs independent multi-source signals whose correlation with quality is an empirical assumption, not a definitional tautology. The method is therefore self-contained against external evaluation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework depends on the existence and utility of a relation graph for prompt enrichment and on the assumption that iterative feedback from generated videos can be used to reliably improve prompts; specific numerical hyperparameters for retrieval and iteration counts are not detailed in the abstract.

free parameters (1)
  • retrieval and iteration hyperparameters
    Parameters controlling how many modifiers are retrieved and how many refinement iterations are performed are likely tuned on validation data.
axioms (1)
  • domain assumption User-provided prompts are typically short, unstructured, and misaligned with the model's training distribution
    This premise is stated directly in the abstract as the motivation for Stage 1.
invented entities (1)
  • relation graph no independent evidence
    purpose: To retrieve semantically relevant modifiers for prompt enrichment
    Introduced as a core component of the RAPO stage; no independent evidence of its construction or coverage is provided in the abstract.

pith-pipeline@v0.9.0 · 5890 in / 1377 out tokens · 37340 ms · 2026-05-18T05:09:49.935993+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 21 internal anchors

  1. [1]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  2. [2]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

  3. [3]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y. Wang, N. Muennighoffet al., “A survey on test-time scaling in large language models: What, how, where, and how well?”arXiv preprint arXiv:2503.24235, 2025

  4. [4]

    Hie-edt: Hierarchical interval estimation-based evidential decision tree,

    B. Gao, Q. Zhou, and Y. Deng, “Hie-edt: Hierarchical interval estimation-based evidential decision tree,”Pattern Recognition, vol. 146, p. 110040, 2024

  5. [5]

    s1: Simple test-time scaling

    N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Cand`es, and T. Hashimoto, “s1: Simple test-time scaling,”arXiv preprint arXiv:2501.19393, 2025

  6. [6]

    The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,

    B. Gao, X. Gao, X. Wu, Y. Zhou, Y. Qiao, L. Niu, X. Chen, and Y. Wang, “The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3173–3183

  7. [7]

    Lift: Leveraging human feedback for text-to-video model alignment,

    Y. Wang, Z. Tan, J. Wang, X. Yang, C. Jin, and H. Li, “Lift: Leveraging human feedback for text-to-video model alignment,” arXiv preprint arXiv:2412.04814, 2024. 15

  8. [8]

    Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinod- kumar Prabhakaran, Utsav Prabhu, Adji Bousso Di- eng, Pushpak Bhattacharyya, and Shachi Dave

    Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jianget al., “Vbench++: Comprehensive and versatile benchmark suite for video generative models,”arXiv preprint arXiv:2411.13503, 2024

  9. [9]

    Bim-afa: Belief information measure-based attribute fusion approach in improving the quality of uncertain data,

    B. Gao, Q. Zhou, and Y. Deng, “Bim-afa: Belief information measure-based attribute fusion approach in improving the quality of uncertain data,”Information Sciences, vol. 608, pp. 950–969, 2022

  10. [10]

    Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation

    X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulraj et al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” arXiv preprint arXiv:2406.15252, 2024

  11. [11]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024

  12. [12]

    Optimizing prompts for text- to-image generation,

    Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 66 923–66 939, 2023

  13. [13]

    Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet

    M. Uehara, Y. Zhao, C. Wang, X. Li, A. Regev, S. Levine, and T. Biancalani, “Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review,”arXiv preprint arXiv:2501.09685, 2025

  14. [14]

    Remasking discrete diffusion models with inference-time scaling

    G. Wang, Y. Schiff, S. S. Sahoo, and V . Kuleshov, “Remasking dis- crete diffusion models with inference-time scaling,”arXiv preprint arXiv:2503.00307, 2025

  15. [15]

    Inference-time scaling of diffusion models through classical search,

    X. Zhang, H. Lin, H. Ye, J. Zou, J. Ma, Y. Liang, and Y. Du, “Inference-time scaling of diffusion models through classical search,”arXiv preprint arXiv:2505.23614, 2025

  16. [16]

    Inference-time scaling for diffusion models beyond scaling denoising steps

    N. Ma, S. Tong, H. Jia, H. Hu, Y.-C. Su, M. Zhang, X. Yang, Y. Li, T. Jaakkola, X. Jia et al., “Inference-time scaling for dif- fusion models beyond scaling denoising steps,” arXiv preprint arXiv:2501.09732, 2025

  17. [17]

    Decouple-then- merge: Finetune diffusion models as multi-task learning,

    Q. Ma, X. Ning, D. Liu, L. Niu, and L. Zhang, “Decouple-then- merge: Finetune diffusion models as multi-task learning,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23 281–23 291

  18. [18]

    Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024

  19. [19]

    Lightfair: Towards an efficient alternative for fair t2i diffusion via debiasing pre-trained text encoders,

    B. Han, Q. Xu, S. Bao, Z. Yang, K. Zi, and Q. Huang, “Lightfair: Towards an efficient alternative for fair t2i diffusion via debiasing pre-trained text encoders,”arXiv preprint arXiv:2509.23639, 2025

  20. [20]

    Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman

    E. Xie, J. Chen, Y. Zhao, J. Yu, L. Zhu, Y. Lin, Z. Zhang, M. Li, J. Chen, H. Cai et al., “Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer,” arXiv preprint arXiv:2501.18427, 2025

  21. [21]

    Inference-time text-to-video alignment with diffusion latent beam search,

    Y. Oshima, M. Suzuki, Y. Matsuo, and H. Furuta, “Inference-time text-to-video alignment with diffusion latent beam search,” arXiv preprint arXiv:2501.19252, 2025

  22. [22]

    Optimizing prompts for text- to-image generation,

    Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024

  23. [23]

    Qwen Technical Report

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  24. [24]

    Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,

    Z. Chen, L. Zhang, F. Weng, L. Pan, and Z. Lan, “Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 7727–7736

  25. [25]

    Dynamic prompt optimizing for text-to-image generation,

    W. Mo, T. Zhang, Y. Bai, B. Su, J.-R. Wen, and Q. Yang, “Dynamic prompt optimizing for text-to-image generation,” in CVPR, 2024

  26. [26]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072, 2024

  27. [27]

    Qiao, and Ziwei Liu

    Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P . Yang et al., “Lavie: High-quality video gen- eration with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023

  28. [28]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023

  29. [29]

    Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,

    M. Wu, L. Wang, P . Zhao, F. Yang, J. Zhang, J. Liu, Y. Zhan, W. Han, H. Sun, J. Jiet al., “Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,”arXiv preprint arXiv:2505.17540, 2025

  30. [30]

    Vbench: Comprehensive benchmark suite for video generative models,

    Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit et al., “Vbench: Comprehensive benchmark suite for video generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818

  31. [31]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y. Cheng, D. Li, Y. Qiao, and P . Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” arXiv preprint arXiv:2410.05363, 2024

  32. [32]

    Latte: Latent Diffusion Transformer for Video Generation

    X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao, “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024

  33. [33]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  34. [34]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

  35. [35]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y. Song, J. Sohl-Dickstein, D. P . Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020

  36. [36]

    Improving image generation with better captions,

    J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023

  37. [37]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  38. [38]

    Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting,

    L. Wang, X. Xing, Y. Cheng, Z. Zhao, J. Tao, Q. Wang, R. Li, X. Li, M. Wu, X. Denget al., “Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting,”arXiv preprint arXiv:2509.04545, 2025

  39. [39]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P . Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning, 2024

  40. [40]

    Adding conditional con- trol to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional con- trol to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847

  41. [41]

    Collaborative diffu- sion for multi-modal face generation and editing,

    Z. Huang, K. C. Chan, Y. Jiang, and Z. Liu, “Collaborative diffu- sion for multi-modal face generation and editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6080–6090

  42. [42]

    T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback

    J. Li, W. Feng, T.-J. Fu, X. Wang, S. Basu, W. Chen, and W. Y. Wang, “T2v-turbo: Breaking the quality bottleneck of video con- sistency model with mixed reward feedback,” arXiv preprint arXiv:2405.18750, 2024

  43. [43]

    Videocrafter2: Overcoming data limitations for high- quality video diffusion models,

    H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high- quality video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320

  44. [44]

    A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,

    N. Hei, Q. Guo, Z. Wang, Y. Wang, H. Wang, and W. Zhang, “A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,” in Proceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2139–2147

  45. [45]

    Prompt refinement with image pivot for text-to-image genera- tion,

    J. Zhan, Q. Ai, Y. Liu, Y. Pan, T. Yao, J. Mao, S. Ma, and T. Mei, “Prompt refinement with image pivot for text-to-image genera- tion,” arXiv preprint arXiv:2407.00247, 2024

  46. [46]

    Open-sora: Democratizing efficient video production for all,

    “Open-sora: Democratizing efficient video production for all,”

  47. [47]

    URL: https: //github.com/hpcaitech/Open-Sora

  48. [48]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

  49. [49]

    Vista: A test-time self-improving video generation agent,

    D. X. Long, X. Wan, H. Nakhost, C.-Y. Lee, T. Pfister, and S. ¨O. Arık, “Vista: A test-time self-improving video generation agent,” arXiv preprint arXiv:2510.15831, 2025

  50. [50]

    Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

    W. Feng, X. He, T.-J. Fu, V . Jampani, A. Akula, P . Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” arXiv preprint arXiv:2212.05032, 2022

  51. [51]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023. 16

  52. [52]

    Magnet: We never know how text-to-image diffusion models work, until we learn how vision- language models function,

    C. Zhuang, Y. Hu, and P . Gao, “Magnet: We never know how text-to-image diffusion models work, until we learn how vision- language models function,” arXiv preprint arXiv:2409.19967, 2024

  53. [53]

    Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,

    T. H. S. Meral, E. Simsar, F. Tombari, and P . Yanardag, “Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 9005–9014

  54. [54]

    Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,

    H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023

  55. [55]

    Grounded text-to-image syn- thesis with attention refocusing,

    Q. Phung, S. Ge, and J.-B. Huang, “Grounded text-to-image syn- thesis with attention refocusing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7932–7942

  56. [56]

    A cat is a cat (not a dog!): Unraveling information mix-ups in text-to-image encoders through causal analysis and embedding optimization,

    C.-Y. Chen, L.-W. Tsao, C. Tseng, and H.-H. Shuai, “A cat is a cat (not a dog!): Unraveling information mix-ups in text-to-image encoders through causal analysis and embedding optimization,” arXiv preprint arXiv:2410.00321, 2024

  57. [57]

    Evalcrafter: Benchmarking and evaluating large video generation models,

    Y. Liu, X. Cun, X. Liu, X. Wang, Y. Zhang, H. Chen, Y. Liu, T. Zeng, R. Chan, and Y. Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149

  58. [58]

    T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

    K. Sun, K. Huang, X. Liu, Y. Wu, Z. Xu, Z. Li, and X. Liu, “T2v- compbench: A comprehensive benchmark for compositional text- to-video generation,” arXiv preprint arXiv:2407.14505, 2024

  59. [59]

    Design guidelines for prompt engineer- ing text-to-image generative models,

    V . Liu and L. B. Chilton, “Design guidelines for prompt engineer- ing text-to-image generative models,” in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1–23

  60. [60]

    What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,

    N. Dehouche and K. Dehouche, “What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,” Heliyon, vol. 9, no. 6, 2023

  61. [61]

    A taxonomy of prompt modifiers for text-to- image generation. arxiv,

    J. Oppenlaender, “A taxonomy of prompt modifiers for text-to- image generation. arxiv,” arXiv preprint arXiv:2204.13988, 2022

  62. [62]

    Movie Gen: A Cast of Media Foundation Models

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang et al., “Movie gen: A cast of media foundation models,” arXiv preprint arXiv:2410.13720, 2024

  63. [63]

    Animate-a-story: Story- telling with retrieval-augmented video generation,

    Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan et al., “Animate-a-story: Story- telling with retrieval-augmented video generation,” arXiv preprint arXiv:2307.06940, 2023

  64. [64]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023

  65. [65]

    Moviedreamer: Hierarchical generation for coherent long visual sequence

    C. Zhao, M. Liu, W. Wang, J. Yuan, H. Chen, B. Zhang, and C. Shen, “Moviedreamer: Hierarchical generation for coherent long visual sequence,” arXiv preprint arXiv:2407.16655, 2024

  66. [66]

    Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,

    X. Wu, B. Gao, Y. Qiao, Y. Wang, and X. Chen, “Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,”arXiv preprint arXiv:2508.11484, 2025

  67. [67]

    Vlogger: Make your dream a vlog,

    S. Zhuang, K. Li, X. Chen, Y. Wang, Z. Liu, Y. Qiao, and Y. Wang, “Vlogger: Make your dream a vlog,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8806–8817

  68. [68]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

  69. [69]

    Scalable diffusion models with transform- ers,

    W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

  70. [70]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierar- chical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

  71. [71]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information pro- cessing systems, vol. 35, pp. 36 479–36 494, 2022

  72. [72]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    A. Nichol, P . Dhariwal, A. Ramesh, P . Shyam, P . Mishkin, B. Mc- Grew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

  73. [73]

    Structure and content-guided video synthesis with diffusion models,

    P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 7346–7356

  74. [74]

    Pyramidal flow matching for efficient video generative modeling

    Y. Jin, Z. Sun, N. Li, K. Xu, H. Jiang, N. Zhuang, Q. Huang, Y. Song, Y. Mu, and Z. Lin, “Pyramidal flow matching for efficient video generative modeling,” arXiv preprint arXiv:2410.05954, 2024

  75. [75]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text- to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022

  76. [76]

    Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,

    B. Han, Q. Xu, Z. Yang, S. Bao, P . Wen, Y. Jiang, and Q. Huang, “Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 863–126 907, 2024

  77. [77]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling

    H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu, “Freenoise: Tuning-free longer video diffusion via noise reschedul- ing,” arXiv preprint arXiv:2310.15169, 2023

  78. [78]

    Gen- l-video: Multi-text to long video generation via temporal co- denoising,

    F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen- l-video: Multi-text to long video generation via temporal co- denoising,” arXiv preprint arXiv:2305.18264, 2023

  79. [79]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

    D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” International Journal of Computer Vision, pp. 1–15, 2024

  80. [80]

    Capability-aware prompt reformulation learning for text-to-image generation,

    J. Zhan, Q. Ai, Y. Liu, J. Chen, and S. Ma, “Capability-aware prompt reformulation learning for text-to-image generation,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2145–2155

Showing first 80 references.