RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

arxiv: 2510.20206 · v2 · pith:727573YJnew · submitted 2025-10-23 · 💻 cs.CV

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao , Qianli Ma , Xiaoxue Wu , Shuai Yang , Guanzhou Lan , Haonan Zhao , Jiaxuan Chen , Qingyang Liu

show 4 more authors

Yu Qiao Xinyuan Chen Yaohui Wang Li Niu

This is my paper

Pith reviewed 2026-05-18 05:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords prompt optimizationtext-to-video generationdiffusion modelsretrieval augmentationtest-time scalingLLM fine-tuningsemantic alignmenttemporal coherence

0 comments p. Extension

pith:727573YJ Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{727573YJ}

Prints a linked pith:727573YJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Cross-stage prompt optimization substantially improves semantic alignment, composition, and temporal stability in text-to-video generation across multiple models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a unified cross-stage prompt optimization framework can substantially improve text-to-video generation by better aligning user prompts with model training data and using iterative test-time refinements. If this holds, it would mean that existing diffusion models could produce videos with superior semantic accuracy, object composition, motion consistency, and realistic physics simply through smarter prompting. The approach combines retrieval from semantic graphs, closed-loop feedback on multiple quality signals, and LLM fine-tuning to internalize the optimizations. A general reader would care because prompt design is accessible and cheap compared to retraining large models, potentially democratizing high-quality video generation. Experiments across models and benchmarks support large gains over existing prompt methods.

Core claim

RAPO++ unifies retrieval-augmented prompt optimization that enriches prompts with modifiers from a relation graph and refactors them to match training distributions, closed-loop sample-specific prompt optimization that iteratively refines prompts using multi-source feedback on semantic alignment, spatial fidelity, temporal coherence, and optical flow, and fine-tuning of the rewriter LLM on optimized prompt pairs to internalize task-specific patterns for efficient generation.

What carries the argument

The three-stage cross-stage prompt optimization process that performs data-aligned refinement, feedback-driven iterative scaling, and LLM internalization to improve outputs without altering the generative backbone.

If this is right

Generated videos show improved handling of multiple objects and complex scene compositions.
Temporal stability increases with fewer motion artifacts and better frame-to-frame consistency.
Physical plausibility rises as depicted actions and object interactions become more realistic.
Gains appear consistently when the method is applied to different underlying text-to-video models.
After fine-tuning, the rewriter LLM produces high-quality prompts efficiently at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged approach could extend to text-to-image or text-to-3D tasks where similar prompt-model mismatches limit output quality.
Adding user preference signals to the feedback loop might enable more personalized video outputs over time.
Testing the method on longer video sequences could show whether coherence holds beyond short clips.

Load-bearing premise

The multi-source feedback signals in the closed-loop stage provide reliable guidance that consistently improves generation quality rather than introducing new artifacts or biases.

What would settle it

Applying the full pipeline to a held-out text-to-video model and benchmark and finding no gains or actual drops in metrics for semantic alignment and temporal coherence would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.20206 by Bingjie Gao, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Li Niu, Qianli Ma, Qingyang Liu, Shuai Yang, Xiaoxue Wu, Xinyuan Chen, Yaohui Wang, Yu Qiao.

**Figure 1.** Figure 1: Overview of RAPO++. The framework couples training-data–aligned prompt refinement with test-time scaling to enhance Text-to-Video (T2V) generation without altering the generative backbone. Stage 1 RAPO: Retrieval-Augmented Prompt Optimization (Sec. 3). User prompts are augmented via a retrieval-based relation graph and refactored by a finetuned LLM, while a frozen LLM provides alternative rewrites. A disc… view at source ↗

**Figure 2.** Figure 2: Generation results under different iterations of prompt refinement at inference utilizing SSPO. The initial prompt is “valkyrie riding flying horses through the clouds”. As the number of iterations increases (from left to right), the generated video becomes more detailed and vivid, and more consistent with the user’s intent. scaling. The framework is organized into three complementary stages. In the first… view at source ↗

**Figure 3.** Figure 3: The construction of relation graph. Relation graph consists of multiple nodes (scenes acting as core nodes with modifiers connected as sub-nodes). For each prompt in database, LLM extracts scene and related modifiers. Based on whether the extracted scene is already in the graph or not, different methods are used to incorporate the new information into the graph. generation. We proceed to introduce each mo… view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons across dynamic and static dimensions. This figure showcases videos generated using LaVie with short prompts, GPT-4 and Open-sora prompt optimizations, and our RAPO method. Videos produced with RAPO exhibit significantly sharper spatial details, smoother temporal transitions, and a closer semantic alignment with the input text. improvements through both quantitative and qualitative c… view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons using LaVie with initial prompts (left) and optimized prompts from RAPO++ (right). We present qualitative comparisons from the dynamic and static dimension. The videos generated by RAPO++ exhibit sharper details, smoother temporal transitions, and better alignment with the input text. evaluation setting. The results in Tab. 8 and Tab. 9 clearly show that integrating Task-Specific A… view at source ↗

**Figure 7.** Figure 7: Visualization on attention map on multiple objects from different prompts. Adding description of the relative spatial position between objects can improve multiobject generation [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 6.** Figure 6: Qualitative examples illustrating the limitation of RAPO++ in numeracy-related compositional tasks. Given prompts ”Five colorful parrots perch on a tree branch” (left) and ”Three majestic giraffes graze on the leaves of tall trees in the African savannah, their long necks reaching high, Salvador Dali style” (right), the generated frames fail to accurately match the specified object counts, highlighting per… view at source ↗

**Figure 8.** Figure 8: Prompt length distribution comparison among various methods. The distribution of RAPO-optimized prompts is more closer to the training prompts. the model’s generative potential to produce better videos. In contrast, user prompts are too short and lack necessary details, while other methods generate longer prompts that contain excessive details and complex vocabulary, which may be counterproductive, as show… view at source ↗

**Figure 9.** Figure 9: A complex unusual example (a panda bear in a red apron and name tag works as a cashier in a Chinese New Year-themed supermarket) generated by initial prompt (left) or optimized prompt (right). The generated video from optimized prompt is more consistent with initial prompt and user intention. specificity, clearer structure, stronger contextual emphasis, and explicit handling of the unusual concept (a panda… view at source ↗

**Figure 10.** Figure 10: Inference-time scaling performance tested on temporal consistency, visual quality, T2V alignment, and factual consistency. We conduct experiments using LaVie [27] and utilize 2.2k T2V prompts provided in [7]. Each metric exhibits a consistent upward trajectory as iteration count increases, underscoring the effectiveness of RAPO++ in enhancing generative performance. TABLE 10: Ablation studies of different… view at source ↗

read the original abstract

Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAPO++ chains retrieval, closed-loop feedback refinement, and LLM fine-tuning into a model-agnostic prompt pipeline for T2V, but the gains depend on feedback signals whose reliability is not yet clear.

read the letter

The main point is that this paper puts forward a three-stage prompt optimization system for text-to-video generation. Stage 1 pulls in modifiers from a relation graph to align user prompts with training distributions. Stage 2 runs an iterative loop that scores candidate prompts on semantic alignment, spatial fidelity, temporal coherence, and optical flow, then feeds the best ones back to a rewriter. Stage 3 fine-tunes the rewriter LLM on the resulting prompt pairs so that later inference can skip the heavy iteration. The combination is presented as new for T2V work. Earlier papers have handled retrieval or test-time prompt tweaks separately, but this one tries to link them end-to-end while staying outside the diffusion backbone itself. Testing the same pipeline on five different T2V models and five benchmarks is a practical strength, and releasing the code helps others verify the claims. The model-agnostic framing also makes sense for people who want better results from existing systems without retraining them. The soft spots sit in the SSPO stage. The closed loop depends on auxiliary scorers being monotonically helpful and free of exploitable biases. The abstract does not show calibration data or ablations that would confirm the composite reward actually tracks human judgments on long videos rather than rewarding prompts that merely look good to those scorers. The stress-test concern about feedback artifacts therefore lands as a real open question until the full tables and controls are examined. The relation graph construction and its potential biases also need more detail. This paper is for researchers and engineers who work on prompt engineering for current diffusion T2V tools and want a ready-to-apply pipeline rather than new generative mechanisms. A reader focused on practical improvements in semantic alignment and temporal stability would find the staged approach worth reading. It deserves peer review so the experimental protocols and quantitative results can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript presents RAPO++, a three-stage cross-stage prompt optimization framework for text-to-video (T2V) generation. Stage 1 (RAPO) retrieves semantically relevant modifiers from a relation graph to enrich and refactor user prompts for better alignment with training data distributions. Stage 2 (SSPO) performs closed-loop iterative refinement of prompts using multi-source feedback signals (semantic alignment, spatial fidelity, temporal coherence, and optical flow). Stage 3 fine-tunes the rewriter LLM on pairs of original and SSPO-optimized prompts. The central empirical claim is that RAPO++ yields significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming prior methods by large margins across five T2V models and five benchmarks, while remaining model-agnostic and without modifying the generative backbone. Code is released at https://github.com/Vchitect/RAPO.

Significance. If the empirical results hold after addressing validation concerns, this would constitute a meaningful contribution to prompt engineering for generative video models by providing a scalable, training-free (at inference) optimization pipeline that combines retrieval, test-time iteration, and distillation. The public code release supports reproducibility. The approach is noteworthy for its model-agnostic nature, but its significance depends on demonstrating that the SSPO feedback loop produces genuine quality improvements rather than metric-specific artifacts.

major comments (2)

[§3.2] §3.2 (SSPO closed-loop description): The composite reward formed from semantic alignment, spatial fidelity, temporal coherence, and optical flow is presented as providing reliable guidance for iterative prompt refinement, yet no calibration study, human correlation analysis, or bias audit is reported. Because the scoring models are themselves imperfect on long-horizon video properties, the loop risks converging on prompts that exploit scorer artifacts rather than improving human-perceived quality. This directly underpins the large-margin gains attributed to Stage 2 and must be addressed with concrete evidence (e.g., human preference studies or failure-case analysis).
[§5] §5 (Experiments): The headline claim of large-margin outperformance across five T2V models and five benchmarks is stated without accompanying quantitative tables, per-stage ablations, or controls that isolate SSPO's contribution from RAPO and LLM fine-tuning. Detailed metrics (e.g., specific CLIP, temporal consistency, or human eval scores) and statistical significance tests are required to substantiate the central empirical assertion.

minor comments (2)

[§3.1] The construction and data sources of the 'relation graph' in Stage 1 are only briefly mentioned; a short appendix or paragraph detailing its creation would improve reproducibility.
[§5] Figure captions and axis labels in the experimental section could be expanded to explicitly state which metric corresponds to which feedback signal used in SSPO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the validation of the SSPO stage and the overall empirical claims.

read point-by-point responses

Referee: [§3.2] §3.2 (SSPO closed-loop description): The composite reward formed from semantic alignment, spatial fidelity, temporal coherence, and optical flow is presented as providing reliable guidance for iterative prompt refinement, yet no calibration study, human correlation analysis, or bias audit is reported. Because the scoring models are themselves imperfect on long-horizon video properties, the loop risks converging on prompts that exploit scorer artifacts rather than improving human-perceived quality. This directly underpins the large-margin gains attributed to Stage 2 and must be addressed with concrete evidence (e.g., human preference studies or failure-case analysis).

Authors: We agree that explicit validation of the composite reward against human judgments is necessary to rule out potential exploitation of scorer artifacts. The individual feedback signals draw from established video evaluation practices, but the current manuscript does not report a dedicated calibration or human correlation study. In the revised version we will add a human preference study with multiple annotators comparing videos from original, RAPO, and SSPO prompts, together with a failure-case analysis that examines cases where the loop improves or fails to improve perceived quality. These additions will directly address the concern. revision: yes
Referee: [§5] §5 (Experiments): The headline claim of large-margin outperformance across five T2V models and five benchmarks is stated without accompanying quantitative tables, per-stage ablations, or controls that isolate SSPO's contribution from RAPO and LLM fine-tuning. Detailed metrics (e.g., specific CLIP, temporal consistency, or human eval scores) and statistical significance tests are required to substantiate the central empirical assertion.

Authors: Section 5 already contains quantitative tables comparing RAPO++ against baselines across the five models and benchmarks. To better isolate SSPO's contribution we will add explicit per-stage ablation tables and controls in the revision. We will also expand the reported metrics with specific CLIP, temporal consistency, and human evaluation scores, and include statistical significance tests (e.g., paired t-tests) to support the claimed margins. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical prompt-optimization pipeline (RAPO retrieval, SSPO closed-loop feedback from auxiliary scorers, and LLM fine-tuning) whose performance claims rest on external benchmarks across five T2V models rather than any mathematical derivation. No equations, fitted parameters, or self-citations are shown to reduce the reported gains to inputs by construction. The SSPO loop employs independent multi-source signals whose correlation with quality is an empirical assumption, not a definitional tautology. The method is therefore self-contained against external evaluation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework depends on the existence and utility of a relation graph for prompt enrichment and on the assumption that iterative feedback from generated videos can be used to reliably improve prompts; specific numerical hyperparameters for retrieval and iteration counts are not detailed in the abstract.

free parameters (1)

retrieval and iteration hyperparameters
Parameters controlling how many modifiers are retrieved and how many refinement iterations are performed are likely tuned on validation data.

axioms (1)

domain assumption User-provided prompts are typically short, unstructured, and misaligned with the model's training distribution
This premise is stated directly in the abstract as the motivation for Stage 1.

invented entities (1)

relation graph no independent evidence
purpose: To retrieve semantically relevant modifiers for prompt enrichment
Introduced as a core component of the RAPO stage; no independent evidence of its construction or coverage is provided in the abstract.

pith-pipeline@v0.9.0 · 5890 in / 1377 out tokens · 37340 ms · 2026-05-18T05:09:49.935993+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stage 2 introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback—including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Average Ranking for Prompt Selection... each candidate refined prompt is evaluated using multiple criteria such as semantic alignment, spatial fidelity, temporal consistency, and physical plausibility

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 21 internal anchors

[1]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y. Wang, N. Muennighoffet al., “A survey on test-time scaling in large language models: What, how, where, and how well?”arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Hie-edt: Hierarchical interval estimation-based evidential decision tree,

B. Gao, Q. Zhou, and Y. Deng, “Hie-edt: Hierarchical interval estimation-based evidential decision tree,”Pattern Recognition, vol. 146, p. 110040, 2024

work page 2024
[5]

s1: Simple test-time scaling

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Cand`es, and T. Hashimoto, “s1: Simple test-time scaling,”arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,

B. Gao, X. Gao, X. Wu, Y. Zhou, Y. Qiao, L. Niu, X. Chen, and Y. Wang, “The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3173–3183

work page 2025
[7]

Lift: Leveraging human feedback for text-to-video model alignment,

Y. Wang, Z. Tan, J. Wang, X. Yang, C. Jin, and H. Li, “Lift: Leveraging human feedback for text-to-video model alignment,” arXiv preprint arXiv:2412.04814, 2024. 15

work page arXiv 2024
[8]

Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinod- kumar Prabhakaran, Utsav Prabhu, Adji Bousso Di- eng, Pushpak Bhattacharyya, and Shachi Dave

Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jianget al., “Vbench++: Comprehensive and versatile benchmark suite for video generative models,”arXiv preprint arXiv:2411.13503, 2024

work page arXiv 2024
[9]

Bim-afa: Belief information measure-based attribute fusion approach in improving the quality of uncertain data,

B. Gao, Q. Zhou, and Y. Deng, “Bim-afa: Belief information measure-based attribute fusion approach in improving the quality of uncertain data,”Information Sciences, vol. 608, pp. 950–969, 2022

work page 2022
[10]

Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulraj et al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” arXiv preprint arXiv:2406.15252, 2024

work page arXiv 2024
[11]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Optimizing prompts for text- to-image generation,

Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 66 923–66 939, 2023

work page 2023
[13]

Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet

M. Uehara, Y. Zhao, C. Wang, X. Li, A. Regev, S. Levine, and T. Biancalani, “Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review,”arXiv preprint arXiv:2501.09685, 2025

work page arXiv 2025
[14]

Remasking discrete diffusion models with inference-time scaling

G. Wang, Y. Schiff, S. S. Sahoo, and V . Kuleshov, “Remasking dis- crete diffusion models with inference-time scaling,”arXiv preprint arXiv:2503.00307, 2025

work page arXiv 2025
[15]

Inference-time scaling of diffusion models through classical search,

X. Zhang, H. Lin, H. Ye, J. Zou, J. Ma, Y. Liang, and Y. Du, “Inference-time scaling of diffusion models through classical search,”arXiv preprint arXiv:2505.23614, 2025

work page arXiv 2025
[16]

Inference-time scaling for diffusion models beyond scaling denoising steps

N. Ma, S. Tong, H. Jia, H. Hu, Y.-C. Su, M. Zhang, X. Yang, Y. Li, T. Jaakkola, X. Jia et al., “Inference-time scaling for dif- fusion models beyond scaling denoising steps,” arXiv preprint arXiv:2501.09732, 2025

work page arXiv 2025
[17]

Decouple-then- merge: Finetune diffusion models as multi-task learning,

Q. Ma, X. Ning, D. Liu, L. Niu, and L. Zhang, “Decouple-then- merge: Finetune diffusion models as multi-task learning,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23 281–23 291

work page 2025
[18]

Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024

work page arXiv 2024
[19]

Lightfair: Towards an efficient alternative for fair t2i diffusion via debiasing pre-trained text encoders,

B. Han, Q. Xu, S. Bao, Z. Yang, K. Zi, and Q. Huang, “Lightfair: Towards an efficient alternative for fair t2i diffusion via debiasing pre-trained text encoders,”arXiv preprint arXiv:2509.23639, 2025

work page arXiv 2025
[20]

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman

E. Xie, J. Chen, Y. Zhao, J. Yu, L. Zhu, Y. Lin, Z. Zhang, M. Li, J. Chen, H. Cai et al., “Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer,” arXiv preprint arXiv:2501.18427, 2025

work page arXiv 2025
[21]

Inference-time text-to-video alignment with diffusion latent beam search,

Y. Oshima, M. Suzuki, Y. Matsuo, and H. Furuta, “Inference-time text-to-video alignment with diffusion latent beam search,” arXiv preprint arXiv:2501.19252, 2025

work page arXiv 2025
[22]

Optimizing prompts for text- to-image generation,

Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[23]

Qwen Technical Report

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,

Z. Chen, L. Zhang, F. Weng, L. Pan, and Z. Lan, “Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 7727–7736

work page 2024
[25]

Dynamic prompt optimizing for text-to-image generation,

W. Mo, T. Zhang, Y. Bai, B. Su, J.-R. Wen, and Q. Yang, “Dynamic prompt optimizing for text-to-image generation,” in CVPR, 2024

work page 2024
[26]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Qiao, and Ziwei Liu

Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P . Yang et al., “Lavie: High-quality video gen- eration with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023

work page arXiv 2023
[28]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,

M. Wu, L. Wang, P . Zhao, F. Yang, J. Zhang, J. Liu, Y. Zhan, W. Han, H. Sun, J. Jiet al., “Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,”arXiv preprint arXiv:2505.17540, 2025

work page arXiv 2025
[30]

Vbench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit et al., “Vbench: Comprehensive benchmark suite for video generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818

work page 2024
[31]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y. Cheng, D. Li, Y. Qiao, and P . Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” arXiv preprint arXiv:2410.05363, 2024

work page internal anchor Pith review arXiv 2024
[32]

Latte: Latent Diffusion Transformer for Video Generation

X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao, “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[34]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Score-Based Generative Modeling through Stochastic Differential Equations

Y. Song, J. Sohl-Dickstein, D. P . Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[36]

Improving image generation with better captions,

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023

work page 2023
[37]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[38]

Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting,

L. Wang, X. Xing, Y. Cheng, Z. Zhao, J. Tao, Q. Wang, R. Li, X. Li, M. Wu, X. Denget al., “Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting,”arXiv preprint arXiv:2509.04545, 2025

work page arXiv 2025
[39]

Scaling rectified flow transformers for high-resolution image synthesis,

P . Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning, 2024

work page 2024
[40]

Adding conditional con- trol to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional con- trol to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847

work page 2023
[41]

Collaborative diffu- sion for multi-modal face generation and editing,

Z. Huang, K. C. Chan, Y. Jiang, and Z. Liu, “Collaborative diffu- sion for multi-modal face generation and editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6080–6090

work page 2023
[42]

T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback

J. Li, W. Feng, T.-J. Fu, X. Wang, S. Basu, W. Chen, and W. Y. Wang, “T2v-turbo: Breaking the quality bottleneck of video con- sistency model with mixed reward feedback,” arXiv preprint arXiv:2405.18750, 2024

work page arXiv 2024
[43]

Videocrafter2: Overcoming data limitations for high- quality video diffusion models,

H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high- quality video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320

work page 2024
[44]

A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,

N. Hei, Q. Guo, Z. Wang, Y. Wang, H. Wang, and W. Zhang, “A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,” in Proceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2139–2147

work page 2024
[45]

Prompt refinement with image pivot for text-to-image genera- tion,

J. Zhan, Q. Ai, Y. Liu, Y. Pan, T. Yao, J. Mao, S. Ma, and T. Mei, “Prompt refinement with image pivot for text-to-image genera- tion,” arXiv preprint arXiv:2407.00247, 2024

work page arXiv 2024
[46]

Open-sora: Democratizing efficient video production for all,

“Open-sora: Democratizing efficient video production for all,”

work page
[47]

URL: https: //github.com/hpcaitech/Open-Sora

work page
[48]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Vista: A test-time self-improving video generation agent,

D. X. Long, X. Wan, H. Nakhost, C.-Y. Lee, T. Pfister, and S. ¨O. Arık, “Vista: A test-time self-improving video generation agent,” arXiv preprint arXiv:2510.15831, 2025

work page arXiv 2025
[50]

Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

W. Feng, X. He, T.-J. Fu, V . Jampani, A. Akula, P . Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” arXiv preprint arXiv:2212.05032, 2022

work page arXiv 2022
[51]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023. 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Magnet: We never know how text-to-image diffusion models work, until we learn how vision- language models function,

C. Zhuang, Y. Hu, and P . Gao, “Magnet: We never know how text-to-image diffusion models work, until we learn how vision- language models function,” arXiv preprint arXiv:2409.19967, 2024

work page arXiv 2024
[53]

Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,

T. H. S. Meral, E. Simsar, F. Tombari, and P . Yanardag, “Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 9005–9014

work page 2024
[54]

Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,

H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023

work page 2023
[55]

Grounded text-to-image syn- thesis with attention refocusing,

Q. Phung, S. Ge, and J.-B. Huang, “Grounded text-to-image syn- thesis with attention refocusing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7932–7942

work page 2024
[56]

A cat is a cat (not a dog!): Unraveling information mix-ups in text-to-image encoders through causal analysis and embedding optimization,

C.-Y. Chen, L.-W. Tsao, C. Tseng, and H.-H. Shuai, “A cat is a cat (not a dog!): Unraveling information mix-ups in text-to-image encoders through causal analysis and embedding optimization,” arXiv preprint arXiv:2410.00321, 2024

work page arXiv 2024
[57]

Evalcrafter: Benchmarking and evaluating large video generation models,

Y. Liu, X. Cun, X. Liu, X. Wang, Y. Zhang, H. Chen, Y. Liu, T. Zeng, R. Chan, and Y. Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149

work page 2024
[58]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

K. Sun, K. Huang, X. Liu, Y. Wu, Z. Xu, Z. Li, and X. Liu, “T2v- compbench: A comprehensive benchmark for compositional text- to-video generation,” arXiv preprint arXiv:2407.14505, 2024

work page arXiv 2024
[59]

Design guidelines for prompt engineer- ing text-to-image generative models,

V . Liu and L. B. Chilton, “Design guidelines for prompt engineer- ing text-to-image generative models,” in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1–23

work page 2022
[60]

What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,

N. Dehouche and K. Dehouche, “What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,” Heliyon, vol. 9, no. 6, 2023

work page 2023
[61]

A taxonomy of prompt modifiers for text-to- image generation. arxiv,

J. Oppenlaender, “A taxonomy of prompt modifiers for text-to- image generation. arxiv,” arXiv preprint arXiv:2204.13988, 2022

work page arXiv 2022
[62]

Movie Gen: A Cast of Media Foundation Models

A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang et al., “Movie gen: A cast of media foundation models,” arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Animate-a-story: Story- telling with retrieval-augmented video generation,

Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan et al., “Animate-a-story: Story- telling with retrieval-augmented video generation,” arXiv preprint arXiv:2307.06940, 2023

work page arXiv 2023
[64]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Moviedreamer: Hierarchical generation for coherent long visual sequence

C. Zhao, M. Liu, W. Wang, J. Yuan, H. Chen, B. Zhang, and C. Shen, “Moviedreamer: Hierarchical generation for coherent long visual sequence,” arXiv preprint arXiv:2407.16655, 2024

work page arXiv 2024
[66]

Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,

X. Wu, B. Gao, Y. Qiao, Y. Wang, and X. Chen, “Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,”arXiv preprint arXiv:2508.11484, 2025

work page arXiv 2025
[67]

Vlogger: Make your dream a vlog,

S. Zhuang, K. Li, X. Chen, Y. Wang, Z. Liu, Y. Qiao, and Y. Wang, “Vlogger: Make your dream a vlog,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8806–8817

work page 2024
[68]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Scalable diffusion models with transform- ers,

W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

work page 2023
[70]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierar- chical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information pro- cessing systems, vol. 35, pp. 36 479–36 494, 2022

work page 2022
[72]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P . Dhariwal, A. Ramesh, P . Shyam, P . Mishkin, B. Mc- Grew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

Structure and content-guided video synthesis with diffusion models,

P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 7346–7356

work page 2023
[74]

Pyramidal flow matching for efficient video generative modeling

Y. Jin, Z. Sun, N. Li, K. Xu, H. Jiang, N. Zhuang, Q. Huang, Y. Song, Y. Mu, and Z. Lin, “Pyramidal flow matching for efficient video generative modeling,” arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024
[75]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text- to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,

B. Han, Q. Xu, Z. Yang, S. Bao, P . Wen, Y. Jiang, and Q. Huang, “Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 863–126 907, 2024

work page 2024
[77]

Freenoise: Tuning-free longer video diffusion via noise rescheduling

H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu, “Freenoise: Tuning-free longer video diffusion via noise reschedul- ing,” arXiv preprint arXiv:2310.15169, 2023

work page arXiv 2023
[78]

Gen- l-video: Multi-text to long video generation via temporal co- denoising,

F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen- l-video: Multi-text to long video generation via temporal co- denoising,” arXiv preprint arXiv:2305.18264, 2023

work page arXiv 2023
[79]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” International Journal of Computer Vision, pp. 1–15, 2024

work page 2024
[80]

Capability-aware prompt reformulation learning for text-to-image generation,

J. Zhan, Q. Ai, Y. Liu, J. Chen, and S. Ma, “Capability-aware prompt reformulation learning for text-to-image generation,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2145–2155

work page 2024

Showing first 80 references.

[1] [1]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y. Wang, N. Muennighoffet al., “A survey on test-time scaling in large language models: What, how, where, and how well?”arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Hie-edt: Hierarchical interval estimation-based evidential decision tree,

B. Gao, Q. Zhou, and Y. Deng, “Hie-edt: Hierarchical interval estimation-based evidential decision tree,”Pattern Recognition, vol. 146, p. 110040, 2024

work page 2024

[5] [5]

s1: Simple test-time scaling

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Cand`es, and T. Hashimoto, “s1: Simple test-time scaling,”arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,

B. Gao, X. Gao, X. Wu, Y. Zhou, Y. Qiao, L. Niu, X. Chen, and Y. Wang, “The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3173–3183

work page 2025

[7] [7]

Lift: Leveraging human feedback for text-to-video model alignment,

Y. Wang, Z. Tan, J. Wang, X. Yang, C. Jin, and H. Li, “Lift: Leveraging human feedback for text-to-video model alignment,” arXiv preprint arXiv:2412.04814, 2024. 15

work page arXiv 2024

[8] [8]

Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinod- kumar Prabhakaran, Utsav Prabhu, Adji Bousso Di- eng, Pushpak Bhattacharyya, and Shachi Dave

Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jianget al., “Vbench++: Comprehensive and versatile benchmark suite for video generative models,”arXiv preprint arXiv:2411.13503, 2024

work page arXiv 2024

[9] [9]

Bim-afa: Belief information measure-based attribute fusion approach in improving the quality of uncertain data,

B. Gao, Q. Zhou, and Y. Deng, “Bim-afa: Belief information measure-based attribute fusion approach in improving the quality of uncertain data,”Information Sciences, vol. 608, pp. 950–969, 2022

work page 2022

[10] [10]

Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulraj et al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” arXiv preprint arXiv:2406.15252, 2024

work page arXiv 2024

[11] [11]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Optimizing prompts for text- to-image generation,

Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 66 923–66 939, 2023

work page 2023

[13] [13]

Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet

M. Uehara, Y. Zhao, C. Wang, X. Li, A. Regev, S. Levine, and T. Biancalani, “Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review,”arXiv preprint arXiv:2501.09685, 2025

work page arXiv 2025

[14] [14]

Remasking discrete diffusion models with inference-time scaling

G. Wang, Y. Schiff, S. S. Sahoo, and V . Kuleshov, “Remasking dis- crete diffusion models with inference-time scaling,”arXiv preprint arXiv:2503.00307, 2025

work page arXiv 2025

[15] [15]

Inference-time scaling of diffusion models through classical search,

X. Zhang, H. Lin, H. Ye, J. Zou, J. Ma, Y. Liang, and Y. Du, “Inference-time scaling of diffusion models through classical search,”arXiv preprint arXiv:2505.23614, 2025

work page arXiv 2025

[16] [16]

Inference-time scaling for diffusion models beyond scaling denoising steps

N. Ma, S. Tong, H. Jia, H. Hu, Y.-C. Su, M. Zhang, X. Yang, Y. Li, T. Jaakkola, X. Jia et al., “Inference-time scaling for dif- fusion models beyond scaling denoising steps,” arXiv preprint arXiv:2501.09732, 2025

work page arXiv 2025

[17] [17]

Decouple-then- merge: Finetune diffusion models as multi-task learning,

Q. Ma, X. Ning, D. Liu, L. Niu, and L. Zhang, “Decouple-then- merge: Finetune diffusion models as multi-task learning,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23 281–23 291

work page 2025

[18] [18]

Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024

work page arXiv 2024

[19] [19]

Lightfair: Towards an efficient alternative for fair t2i diffusion via debiasing pre-trained text encoders,

B. Han, Q. Xu, S. Bao, Z. Yang, K. Zi, and Q. Huang, “Lightfair: Towards an efficient alternative for fair t2i diffusion via debiasing pre-trained text encoders,”arXiv preprint arXiv:2509.23639, 2025

work page arXiv 2025

[20] [20]

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman

E. Xie, J. Chen, Y. Zhao, J. Yu, L. Zhu, Y. Lin, Z. Zhang, M. Li, J. Chen, H. Cai et al., “Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer,” arXiv preprint arXiv:2501.18427, 2025

work page arXiv 2025

[21] [21]

Inference-time text-to-video alignment with diffusion latent beam search,

Y. Oshima, M. Suzuki, Y. Matsuo, and H. Furuta, “Inference-time text-to-video alignment with diffusion latent beam search,” arXiv preprint arXiv:2501.19252, 2025

work page arXiv 2025

[22] [22]

Optimizing prompts for text- to-image generation,

Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[23] [23]

Qwen Technical Report

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,

Z. Chen, L. Zhang, F. Weng, L. Pan, and Z. Lan, “Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 7727–7736

work page 2024

[25] [25]

Dynamic prompt optimizing for text-to-image generation,

W. Mo, T. Zhang, Y. Bai, B. Su, J.-R. Wen, and Q. Yang, “Dynamic prompt optimizing for text-to-image generation,” in CVPR, 2024

work page 2024

[26] [26]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Qiao, and Ziwei Liu

Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P . Yang et al., “Lavie: High-quality video gen- eration with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023

work page arXiv 2023

[28] [28]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,

M. Wu, L. Wang, P . Zhao, F. Yang, J. Zhang, J. Liu, Y. Zhan, W. Han, H. Sun, J. Jiet al., “Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,”arXiv preprint arXiv:2505.17540, 2025

work page arXiv 2025

[30] [30]

Vbench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit et al., “Vbench: Comprehensive benchmark suite for video generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818

work page 2024

[31] [31]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y. Cheng, D. Li, Y. Qiao, and P . Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” arXiv preprint arXiv:2410.05363, 2024

work page internal anchor Pith review arXiv 2024

[32] [32]

Latte: Latent Diffusion Transformer for Video Generation

X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao, “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[34] [34]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[35] [35]

Score-Based Generative Modeling through Stochastic Differential Equations

Y. Song, J. Sohl-Dickstein, D. P . Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[36] [36]

Improving image generation with better captions,

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023

work page 2023

[37] [37]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022

[38] [38]

Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting,

L. Wang, X. Xing, Y. Cheng, Z. Zhao, J. Tao, Q. Wang, R. Li, X. Li, M. Wu, X. Denget al., “Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting,”arXiv preprint arXiv:2509.04545, 2025

work page arXiv 2025

[39] [39]

Scaling rectified flow transformers for high-resolution image synthesis,

P . Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning, 2024

work page 2024

[40] [40]

Adding conditional con- trol to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional con- trol to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847

work page 2023

[41] [41]

Collaborative diffu- sion for multi-modal face generation and editing,

Z. Huang, K. C. Chan, Y. Jiang, and Z. Liu, “Collaborative diffu- sion for multi-modal face generation and editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6080–6090

work page 2023

[42] [42]

T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback

J. Li, W. Feng, T.-J. Fu, X. Wang, S. Basu, W. Chen, and W. Y. Wang, “T2v-turbo: Breaking the quality bottleneck of video con- sistency model with mixed reward feedback,” arXiv preprint arXiv:2405.18750, 2024

work page arXiv 2024

[43] [43]

Videocrafter2: Overcoming data limitations for high- quality video diffusion models,

H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high- quality video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320

work page 2024

[44] [44]

A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,

N. Hei, Q. Guo, Z. Wang, Y. Wang, H. Wang, and W. Zhang, “A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,” in Proceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2139–2147

work page 2024

[45] [45]

Prompt refinement with image pivot for text-to-image genera- tion,

J. Zhan, Q. Ai, Y. Liu, Y. Pan, T. Yao, J. Mao, S. Ma, and T. Mei, “Prompt refinement with image pivot for text-to-image genera- tion,” arXiv preprint arXiv:2407.00247, 2024

work page arXiv 2024

[46] [46]

Open-sora: Democratizing efficient video production for all,

“Open-sora: Democratizing efficient video production for all,”

work page

[47] [47]

URL: https: //github.com/hpcaitech/Open-Sora

work page

[48] [48]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Vista: A test-time self-improving video generation agent,

D. X. Long, X. Wan, H. Nakhost, C.-Y. Lee, T. Pfister, and S. ¨O. Arık, “Vista: A test-time self-improving video generation agent,” arXiv preprint arXiv:2510.15831, 2025

work page arXiv 2025

[50] [50]

Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

W. Feng, X. He, T.-J. Fu, V . Jampani, A. Akula, P . Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” arXiv preprint arXiv:2212.05032, 2022

work page arXiv 2022

[51] [51]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023. 16

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Magnet: We never know how text-to-image diffusion models work, until we learn how vision- language models function,

C. Zhuang, Y. Hu, and P . Gao, “Magnet: We never know how text-to-image diffusion models work, until we learn how vision- language models function,” arXiv preprint arXiv:2409.19967, 2024

work page arXiv 2024

[53] [53]

Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,

T. H. S. Meral, E. Simsar, F. Tombari, and P . Yanardag, “Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 9005–9014

work page 2024

[54] [54]

Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,

H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023

work page 2023

[55] [55]

Grounded text-to-image syn- thesis with attention refocusing,

Q. Phung, S. Ge, and J.-B. Huang, “Grounded text-to-image syn- thesis with attention refocusing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7932–7942

work page 2024

[56] [56]

A cat is a cat (not a dog!): Unraveling information mix-ups in text-to-image encoders through causal analysis and embedding optimization,

C.-Y. Chen, L.-W. Tsao, C. Tseng, and H.-H. Shuai, “A cat is a cat (not a dog!): Unraveling information mix-ups in text-to-image encoders through causal analysis and embedding optimization,” arXiv preprint arXiv:2410.00321, 2024

work page arXiv 2024

[57] [57]

Evalcrafter: Benchmarking and evaluating large video generation models,

Y. Liu, X. Cun, X. Liu, X. Wang, Y. Zhang, H. Chen, Y. Liu, T. Zeng, R. Chan, and Y. Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149

work page 2024

[58] [58]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

K. Sun, K. Huang, X. Liu, Y. Wu, Z. Xu, Z. Li, and X. Liu, “T2v- compbench: A comprehensive benchmark for compositional text- to-video generation,” arXiv preprint arXiv:2407.14505, 2024

work page arXiv 2024

[59] [59]

Design guidelines for prompt engineer- ing text-to-image generative models,

V . Liu and L. B. Chilton, “Design guidelines for prompt engineer- ing text-to-image generative models,” in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1–23

work page 2022

[60] [60]

What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,

N. Dehouche and K. Dehouche, “What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,” Heliyon, vol. 9, no. 6, 2023

work page 2023

[61] [61]

A taxonomy of prompt modifiers for text-to- image generation. arxiv,

J. Oppenlaender, “A taxonomy of prompt modifiers for text-to- image generation. arxiv,” arXiv preprint arXiv:2204.13988, 2022

work page arXiv 2022

[62] [62]

Movie Gen: A Cast of Media Foundation Models

A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang et al., “Movie gen: A cast of media foundation models,” arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Animate-a-story: Story- telling with retrieval-augmented video generation,

Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan et al., “Animate-a-story: Story- telling with retrieval-augmented video generation,” arXiv preprint arXiv:2307.06940, 2023

work page arXiv 2023

[64] [64]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

Moviedreamer: Hierarchical generation for coherent long visual sequence

C. Zhao, M. Liu, W. Wang, J. Yuan, H. Chen, B. Zhang, and C. Shen, “Moviedreamer: Hierarchical generation for coherent long visual sequence,” arXiv preprint arXiv:2407.16655, 2024

work page arXiv 2024

[66] [66]

Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,

X. Wu, B. Gao, Y. Qiao, Y. Wang, and X. Chen, “Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,”arXiv preprint arXiv:2508.11484, 2025

work page arXiv 2025

[67] [67]

Vlogger: Make your dream a vlog,

S. Zhuang, K. Li, X. Chen, Y. Wang, Z. Liu, Y. Qiao, and Y. Wang, “Vlogger: Make your dream a vlog,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8806–8817

work page 2024

[68] [68]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Scalable diffusion models with transform- ers,

W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

work page 2023

[70] [70]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierar- chical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[71] [71]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information pro- cessing systems, vol. 35, pp. 36 479–36 494, 2022

work page 2022

[72] [72]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P . Dhariwal, A. Ramesh, P . Shyam, P . Mishkin, B. Mc- Grew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[73] [73]

Structure and content-guided video synthesis with diffusion models,

P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 7346–7356

work page 2023

[74] [74]

Pyramidal flow matching for efficient video generative modeling

Y. Jin, Z. Sun, N. Li, K. Xu, H. Jiang, N. Zhuang, Q. Huang, Y. Song, Y. Mu, and Z. Lin, “Pyramidal flow matching for efficient video generative modeling,” arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024

[75] [75]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text- to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[76] [76]

Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,

B. Han, Q. Xu, Z. Yang, S. Bao, P . Wen, Y. Jiang, and Q. Huang, “Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 863–126 907, 2024

work page 2024

[77] [77]

Freenoise: Tuning-free longer video diffusion via noise rescheduling

H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu, “Freenoise: Tuning-free longer video diffusion via noise reschedul- ing,” arXiv preprint arXiv:2310.15169, 2023

work page arXiv 2023

[78] [78]

Gen- l-video: Multi-text to long video generation via temporal co- denoising,

F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen- l-video: Multi-text to long video generation via temporal co- denoising,” arXiv preprint arXiv:2305.18264, 2023

work page arXiv 2023

[79] [79]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” International Journal of Computer Vision, pp. 1–15, 2024

work page 2024

[80] [80]

Capability-aware prompt reformulation learning for text-to-image generation,

J. Zhan, Q. Ai, Y. Liu, J. Chen, and S. Ma, “Capability-aware prompt reformulation learning for text-to-image generation,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2145–2155

work page 2024