RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
Pith reviewed 2026-05-18 05:09 UTC · model grok-4.3
pith:727573YJ Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{727573YJ}
Prints a linked pith:727573YJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Cross-stage prompt optimization substantially improves semantic alignment, composition, and temporal stability in text-to-video generation across multiple models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAPO++ unifies retrieval-augmented prompt optimization that enriches prompts with modifiers from a relation graph and refactors them to match training distributions, closed-loop sample-specific prompt optimization that iteratively refines prompts using multi-source feedback on semantic alignment, spatial fidelity, temporal coherence, and optical flow, and fine-tuning of the rewriter LLM on optimized prompt pairs to internalize task-specific patterns for efficient generation.
What carries the argument
The three-stage cross-stage prompt optimization process that performs data-aligned refinement, feedback-driven iterative scaling, and LLM internalization to improve outputs without altering the generative backbone.
If this is right
- Generated videos show improved handling of multiple objects and complex scene compositions.
- Temporal stability increases with fewer motion artifacts and better frame-to-frame consistency.
- Physical plausibility rises as depicted actions and object interactions become more realistic.
- Gains appear consistently when the method is applied to different underlying text-to-video models.
- After fine-tuning, the rewriter LLM produces high-quality prompts efficiently at inference time.
Where Pith is reading between the lines
- The staged approach could extend to text-to-image or text-to-3D tasks where similar prompt-model mismatches limit output quality.
- Adding user preference signals to the feedback loop might enable more personalized video outputs over time.
- Testing the method on longer video sequences could show whether coherence holds beyond short clips.
Load-bearing premise
The multi-source feedback signals in the closed-loop stage provide reliable guidance that consistently improves generation quality rather than introducing new artifacts or biases.
What would settle it
Applying the full pipeline to a held-out text-to-video model and benchmark and finding no gains or actual drops in metrics for semantic alignment and temporal coherence would falsify the central claim.
Figures
read the original abstract
Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents RAPO++, a three-stage cross-stage prompt optimization framework for text-to-video (T2V) generation. Stage 1 (RAPO) retrieves semantically relevant modifiers from a relation graph to enrich and refactor user prompts for better alignment with training data distributions. Stage 2 (SSPO) performs closed-loop iterative refinement of prompts using multi-source feedback signals (semantic alignment, spatial fidelity, temporal coherence, and optical flow). Stage 3 fine-tunes the rewriter LLM on pairs of original and SSPO-optimized prompts. The central empirical claim is that RAPO++ yields significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming prior methods by large margins across five T2V models and five benchmarks, while remaining model-agnostic and without modifying the generative backbone. Code is released at https://github.com/Vchitect/RAPO.
Significance. If the empirical results hold after addressing validation concerns, this would constitute a meaningful contribution to prompt engineering for generative video models by providing a scalable, training-free (at inference) optimization pipeline that combines retrieval, test-time iteration, and distillation. The public code release supports reproducibility. The approach is noteworthy for its model-agnostic nature, but its significance depends on demonstrating that the SSPO feedback loop produces genuine quality improvements rather than metric-specific artifacts.
major comments (2)
- [§3.2] §3.2 (SSPO closed-loop description): The composite reward formed from semantic alignment, spatial fidelity, temporal coherence, and optical flow is presented as providing reliable guidance for iterative prompt refinement, yet no calibration study, human correlation analysis, or bias audit is reported. Because the scoring models are themselves imperfect on long-horizon video properties, the loop risks converging on prompts that exploit scorer artifacts rather than improving human-perceived quality. This directly underpins the large-margin gains attributed to Stage 2 and must be addressed with concrete evidence (e.g., human preference studies or failure-case analysis).
- [§5] §5 (Experiments): The headline claim of large-margin outperformance across five T2V models and five benchmarks is stated without accompanying quantitative tables, per-stage ablations, or controls that isolate SSPO's contribution from RAPO and LLM fine-tuning. Detailed metrics (e.g., specific CLIP, temporal consistency, or human eval scores) and statistical significance tests are required to substantiate the central empirical assertion.
minor comments (2)
- [§3.1] The construction and data sources of the 'relation graph' in Stage 1 are only briefly mentioned; a short appendix or paragraph detailing its creation would improve reproducibility.
- [§5] Figure captions and axis labels in the experimental section could be expanded to explicitly state which metric corresponds to which feedback signal used in SSPO.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the validation of the SSPO stage and the overall empirical claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (SSPO closed-loop description): The composite reward formed from semantic alignment, spatial fidelity, temporal coherence, and optical flow is presented as providing reliable guidance for iterative prompt refinement, yet no calibration study, human correlation analysis, or bias audit is reported. Because the scoring models are themselves imperfect on long-horizon video properties, the loop risks converging on prompts that exploit scorer artifacts rather than improving human-perceived quality. This directly underpins the large-margin gains attributed to Stage 2 and must be addressed with concrete evidence (e.g., human preference studies or failure-case analysis).
Authors: We agree that explicit validation of the composite reward against human judgments is necessary to rule out potential exploitation of scorer artifacts. The individual feedback signals draw from established video evaluation practices, but the current manuscript does not report a dedicated calibration or human correlation study. In the revised version we will add a human preference study with multiple annotators comparing videos from original, RAPO, and SSPO prompts, together with a failure-case analysis that examines cases where the loop improves or fails to improve perceived quality. These additions will directly address the concern. revision: yes
-
Referee: [§5] §5 (Experiments): The headline claim of large-margin outperformance across five T2V models and five benchmarks is stated without accompanying quantitative tables, per-stage ablations, or controls that isolate SSPO's contribution from RAPO and LLM fine-tuning. Detailed metrics (e.g., specific CLIP, temporal consistency, or human eval scores) and statistical significance tests are required to substantiate the central empirical assertion.
Authors: Section 5 already contains quantitative tables comparing RAPO++ against baselines across the five models and benchmarks. To better isolate SSPO's contribution we will add explicit per-stage ablation tables and controls in the revision. We will also expand the reported metrics with specific CLIP, temporal consistency, and human evaluation scores, and include statistical significance tests (e.g., paired t-tests) to support the claimed margins. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical prompt-optimization pipeline (RAPO retrieval, SSPO closed-loop feedback from auxiliary scorers, and LLM fine-tuning) whose performance claims rest on external benchmarks across five T2V models rather than any mathematical derivation. No equations, fitted parameters, or self-citations are shown to reduce the reported gains to inputs by construction. The SSPO loop employs independent multi-source signals whose correlation with quality is an empirical assumption, not a definitional tautology. The method is therefore self-contained against external evaluation and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- retrieval and iteration hyperparameters
axioms (1)
- domain assumption User-provided prompts are typically short, unstructured, and misaligned with the model's training distribution
invented entities (1)
-
relation graph
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Stage 2 introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback—including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Average Ranking for Prompt Selection... each candidate refined prompt is evaluated using multiple criteria such as semantic alignment, spatial fidelity, temporal consistency, and physical plausibility
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Wan: Open and Advanced Large-Scale Video Generative Models
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y. Wang, N. Muennighoffet al., “A survey on test-time scaling in large language models: What, how, where, and how well?”arXiv preprint arXiv:2503.24235, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Hie-edt: Hierarchical interval estimation-based evidential decision tree,
B. Gao, Q. Zhou, and Y. Deng, “Hie-edt: Hierarchical interval estimation-based evidential decision tree,”Pattern Recognition, vol. 146, p. 110040, 2024
work page 2024
-
[5]
N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Cand`es, and T. Hashimoto, “s1: Simple test-time scaling,”arXiv preprint arXiv:2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,
B. Gao, X. Gao, X. Wu, Y. Zhou, Y. Qiao, L. Niu, X. Chen, and Y. Wang, “The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3173–3183
work page 2025
-
[7]
Lift: Leveraging human feedback for text-to-video model alignment,
Y. Wang, Z. Tan, J. Wang, X. Yang, C. Jin, and H. Li, “Lift: Leveraging human feedback for text-to-video model alignment,” arXiv preprint arXiv:2412.04814, 2024. 15
-
[8]
Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jianget al., “Vbench++: Comprehensive and versatile benchmark suite for video generative models,”arXiv preprint arXiv:2411.13503, 2024
-
[9]
B. Gao, Q. Zhou, and Y. Deng, “Bim-afa: Belief information measure-based attribute fusion approach in improving the quality of uncertain data,”Information Sciences, vol. 608, pp. 950–969, 2022
work page 2022
-
[10]
Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation
X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulraj et al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” arXiv preprint arXiv:2406.15252, 2024
-
[11]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Optimizing prompts for text- to-image generation,
Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 66 923–66 939, 2023
work page 2023
-
[13]
Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet
M. Uehara, Y. Zhao, C. Wang, X. Li, A. Regev, S. Levine, and T. Biancalani, “Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review,”arXiv preprint arXiv:2501.09685, 2025
-
[14]
Remasking discrete diffusion models with inference-time scaling
G. Wang, Y. Schiff, S. S. Sahoo, and V . Kuleshov, “Remasking dis- crete diffusion models with inference-time scaling,”arXiv preprint arXiv:2503.00307, 2025
-
[15]
Inference-time scaling of diffusion models through classical search,
X. Zhang, H. Lin, H. Ye, J. Zou, J. Ma, Y. Liang, and Y. Du, “Inference-time scaling of diffusion models through classical search,”arXiv preprint arXiv:2505.23614, 2025
-
[16]
Inference-time scaling for diffusion models beyond scaling denoising steps
N. Ma, S. Tong, H. Jia, H. Hu, Y.-C. Su, M. Zhang, X. Yang, Y. Li, T. Jaakkola, X. Jia et al., “Inference-time scaling for dif- fusion models beyond scaling denoising steps,” arXiv preprint arXiv:2501.09732, 2025
-
[17]
Decouple-then- merge: Finetune diffusion models as multi-task learning,
Q. Ma, X. Ning, D. Liu, L. Niu, and L. Zhang, “Decouple-then- merge: Finetune diffusion models as multi-task learning,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23 281–23 291
work page 2025
-
[18]
Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,
H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024
-
[19]
B. Han, Q. Xu, S. Bao, Z. Yang, K. Zi, and Q. Huang, “Lightfair: Towards an efficient alternative for fair t2i diffusion via debiasing pre-trained text encoders,”arXiv preprint arXiv:2509.23639, 2025
-
[20]
E. Xie, J. Chen, Y. Zhao, J. Yu, L. Zhu, Y. Lin, Z. Zhang, M. Li, J. Chen, H. Cai et al., “Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer,” arXiv preprint arXiv:2501.18427, 2025
-
[21]
Inference-time text-to-video alignment with diffusion latent beam search,
Y. Oshima, M. Suzuki, Y. Matsuo, and H. Furuta, “Inference-time text-to-video alignment with diffusion latent beam search,” arXiv preprint arXiv:2501.19252, 2025
-
[22]
Optimizing prompts for text- to-image generation,
Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text- to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[23]
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,
Z. Chen, L. Zhang, F. Weng, L. Pan, and Z. Lan, “Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 7727–7736
work page 2024
-
[25]
Dynamic prompt optimizing for text-to-image generation,
W. Mo, T. Zhang, Y. Bai, B. Su, J.-R. Wen, and Q. Yang, “Dynamic prompt optimizing for text-to-image generation,” in CVPR, 2024
work page 2024
-
[26]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P . Yang et al., “Lavie: High-quality video gen- eration with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023
-
[28]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,
M. Wu, L. Wang, P . Zhao, F. Yang, J. Zhang, J. Liu, Y. Zhan, W. Han, H. Sun, J. Jiet al., “Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning,”arXiv preprint arXiv:2505.17540, 2025
-
[30]
Vbench: Comprehensive benchmark suite for video generative models,
Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit et al., “Vbench: Comprehensive benchmark suite for video generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818
work page 2024
-
[31]
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y. Cheng, D. Li, Y. Qiao, and P . Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” arXiv preprint arXiv:2410.05363, 2024
work page internal anchor Pith review arXiv 2024
-
[32]
Latte: Latent Diffusion Transformer for Video Generation
X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao, “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[34]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[35]
Score-Based Generative Modeling through Stochastic Differential Equations
Y. Song, J. Sohl-Dickstein, D. P . Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[36]
Improving image generation with better captions,
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023
work page 2023
-
[37]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[38]
L. Wang, X. Xing, Y. Cheng, Z. Zhao, J. Tao, Q. Wang, R. Li, X. Li, M. Wu, X. Denget al., “Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting,”arXiv preprint arXiv:2509.04545, 2025
-
[39]
Scaling rectified flow transformers for high-resolution image synthesis,
P . Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[40]
Adding conditional con- trol to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional con- trol to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847
work page 2023
-
[41]
Collaborative diffu- sion for multi-modal face generation and editing,
Z. Huang, K. C. Chan, Y. Jiang, and Z. Liu, “Collaborative diffu- sion for multi-modal face generation and editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6080–6090
work page 2023
-
[42]
T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback
J. Li, W. Feng, T.-J. Fu, X. Wang, S. Basu, W. Chen, and W. Y. Wang, “T2v-turbo: Breaking the quality bottleneck of video con- sistency model with mixed reward feedback,” arXiv preprint arXiv:2405.18750, 2024
-
[43]
Videocrafter2: Overcoming data limitations for high- quality video diffusion models,
H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high- quality video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320
work page 2024
-
[44]
A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,
N. Hei, Q. Guo, Z. Wang, Y. Wang, H. Wang, and W. Zhang, “A user-friendly framework for generating model-preferred prompts in text-to-image synthesis,” in Proceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2139–2147
work page 2024
-
[45]
Prompt refinement with image pivot for text-to-image genera- tion,
J. Zhan, Q. Ai, Y. Liu, Y. Pan, T. Yao, J. Mao, S. Ma, and T. Mei, “Prompt refinement with image pivot for text-to-image genera- tion,” arXiv preprint arXiv:2407.00247, 2024
-
[46]
Open-sora: Democratizing efficient video production for all,
“Open-sora: Democratizing efficient video production for all,”
-
[47]
URL: https: //github.com/hpcaitech/Open-Sora
-
[48]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Vista: A test-time self-improving video generation agent,
D. X. Long, X. Wan, H. Nakhost, C.-Y. Lee, T. Pfister, and S. ¨O. Arık, “Vista: A test-time self-improving video generation agent,” arXiv preprint arXiv:2510.15831, 2025
-
[50]
W. Feng, X. He, T.-J. Fu, V . Jampani, A. Akula, P . Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” arXiv preprint arXiv:2212.05032, 2022
-
[51]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023. 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
C. Zhuang, Y. Hu, and P . Gao, “Magnet: We never know how text-to-image diffusion models work, until we learn how vision- language models function,” arXiv preprint arXiv:2409.19967, 2024
-
[53]
Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,
T. H. S. Meral, E. Simsar, F. Tombari, and P . Yanardag, “Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 9005–9014
work page 2024
-
[54]
Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,
H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend- and-excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023
work page 2023
-
[55]
Grounded text-to-image syn- thesis with attention refocusing,
Q. Phung, S. Ge, and J.-B. Huang, “Grounded text-to-image syn- thesis with attention refocusing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7932–7942
work page 2024
-
[56]
C.-Y. Chen, L.-W. Tsao, C. Tseng, and H.-H. Shuai, “A cat is a cat (not a dog!): Unraveling information mix-ups in text-to-image encoders through causal analysis and embedding optimization,” arXiv preprint arXiv:2410.00321, 2024
-
[57]
Evalcrafter: Benchmarking and evaluating large video generation models,
Y. Liu, X. Cun, X. Liu, X. Wang, Y. Zhang, H. Chen, Y. Liu, T. Zeng, R. Chan, and Y. Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149
work page 2024
-
[58]
T2v-compbench: A comprehensive benchmark for compositional text-to-video generation
K. Sun, K. Huang, X. Liu, Y. Wu, Z. Xu, Z. Li, and X. Liu, “T2v- compbench: A comprehensive benchmark for compositional text- to-video generation,” arXiv preprint arXiv:2407.14505, 2024
-
[59]
Design guidelines for prompt engineer- ing text-to-image generative models,
V . Liu and L. B. Chilton, “Design guidelines for prompt engineer- ing text-to-image generative models,” in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1–23
work page 2022
-
[60]
What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,
N. Dehouche and K. Dehouche, “What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,” Heliyon, vol. 9, no. 6, 2023
work page 2023
-
[61]
A taxonomy of prompt modifiers for text-to- image generation. arxiv,
J. Oppenlaender, “A taxonomy of prompt modifiers for text-to- image generation. arxiv,” arXiv preprint arXiv:2204.13988, 2022
-
[62]
Movie Gen: A Cast of Media Foundation Models
A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang et al., “Movie gen: A cast of media foundation models,” arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Animate-a-story: Story- telling with retrieval-augmented video generation,
Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan et al., “Animate-a-story: Story- telling with retrieval-augmented video generation,” arXiv preprint arXiv:2307.06940, 2023
-
[64]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Moviedreamer: Hierarchical generation for coherent long visual sequence
C. Zhao, M. Liu, W. Wang, J. Yuan, H. Chen, B. Zhang, and C. Shen, “Moviedreamer: Hierarchical generation for coherent long visual sequence,” arXiv preprint arXiv:2407.16655, 2024
-
[66]
Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,
X. Wu, B. Gao, Y. Qiao, Y. Wang, and X. Chen, “Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models,”arXiv preprint arXiv:2508.11484, 2025
-
[67]
Vlogger: Make your dream a vlog,
S. Zhuang, K. Li, X. Chen, Y. Wang, Z. Liu, Y. Qiao, and Y. Wang, “Vlogger: Make your dream a vlog,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8806–8817
work page 2024
-
[68]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Scalable diffusion models with transform- ers,
W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205
work page 2023
-
[70]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierar- chical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[71]
Photorealistic text-to-image diffusion models with deep language understanding,
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information pro- cessing systems, vol. 35, pp. 36 479–36 494, 2022
work page 2022
-
[72]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A. Nichol, P . Dhariwal, A. Ramesh, P . Shyam, P . Mishkin, B. Mc- Grew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
Structure and content-guided video synthesis with diffusion models,
P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 7346–7356
work page 2023
-
[74]
Pyramidal flow matching for efficient video generative modeling
Y. Jin, Z. Sun, N. Li, K. Xu, H. Jiang, N. Zhuang, Q. Huang, Y. Song, Y. Mu, and Z. Lin, “Pyramidal flow matching for efficient video generative modeling,” arXiv preprint arXiv:2410.05954, 2024
-
[75]
Make-A-Video: Text-to-Video Generation without Text-Video Data
U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text- to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,
B. Han, Q. Xu, Z. Yang, S. Bao, P . Wen, Y. Jiang, and Q. Huang, “Aucseg: Auc-oriented pixel-level long-tail semantic segmenta- tion,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 863–126 907, 2024
work page 2024
-
[77]
Freenoise: Tuning-free longer video diffusion via noise rescheduling
H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu, “Freenoise: Tuning-free longer video diffusion via noise reschedul- ing,” arXiv preprint arXiv:2310.15169, 2023
-
[78]
Gen- l-video: Multi-text to long video generation via temporal co- denoising,
F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen- l-video: Multi-text to long video generation via temporal co- denoising,” arXiv preprint arXiv:2305.18264, 2023
-
[79]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation,
D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” International Journal of Computer Vision, pp. 1–15, 2024
work page 2024
-
[80]
Capability-aware prompt reformulation learning for text-to-image generation,
J. Zhan, Q. Ai, Y. Liu, J. Chen, and S. Ma, “Capability-aware prompt reformulation learning for text-to-image generation,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2145–2155
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.