{"total":12,"items":[{"citing_arxiv_id":"2606.22131","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Feed-forward Motion In-betweening for Any 4D","primary_cat":"cs.CV","submitted_at":"2026-06-20T16:18:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes a feed-forward keyframe-conditioned in-betweening method for arbitrary 4D meshes using a topology-agnostic VAE and MMDiT-based rectified flow model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31204","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Probabilistic Precipitation Nowcasting with Rectified Flow Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-29T12:11:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FREUD applies rectified flow transformers with frame-wise encoding and a unified decoder to achieve state-of-the-art probabilistic precipitation nowcasting on the SEVIR benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.21996","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VRAG: Learning World Models for Interactive Video Generation","primary_cat":"cs.CV","submitted_at":"2025-05-28T05:55:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.15689","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization","primary_cat":"cs.CV","submitted_at":"2024-12-20T09:07:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DOLLAR combines variational score and consistency distillation for few-step video generation plus latent reward optimization, reporting 82.57 VBench score and up to 278x speedup over the teacher diffusion model for 128-frame 10-second videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.04145","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models","primary_cat":"cs.CV","submitted_at":"2023-11-07T17:16:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.08089","ref_index":291,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory","primary_cat":"cs.CV","submitted_at":"2023-08-16T01:43:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.06571","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ModelScope Text-to-Video Technical Report","primary_cat":"cs.CV","submitted_at":"2023-08-12T13:53:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[12] Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, and Jianzhou Zhang. Progres- sive learning without forgetting. arXiv preprint arXiv:2211.15215, 2022. [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139-144, 2020. [14] William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. [15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in"},{"citing_arxiv_id":"2211.11018","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MagicVideo: Efficient Video Generation With Latent Diffusion Models","primary_cat":"cs.CV","submitted_at":"2022-11-20T16:40:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.01324","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers","primary_cat":"cs.CV","submitted_at":"2022-11-02T17:43:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"that use deterministic [15, 34, 43, 46, 70, 89] or stochas- tic [4, 5, 16, 90] iterative updates. Several works retrieve auxiliary images related to the text prompt from an external database and condition generation on them to boost perfor- mance [7, 9, 66]. Recently, several text-to-video diffusion models were proposed and achieved high-quality video gen- eration results [20, 25, 29, 67, 85]. Applications of text-to-image diffusion models Apart from serving as a backbone to be ﬁne-tuned for general image-to-image translation tasks [80], text-to-image diffu- sion models have also demonstrated impressive capabilities in other downstream applications. Diffusion models can be directly applied to various inverse problems, such as super-"},{"citing_arxiv_id":"2210.02399","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Phenaki: Variable Length Video Generation From Open Domain Textual Description","primary_cat":"cs.CV","submitted_at":"2022-10-05T17:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2210.02303","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Imagen Video: High Definition Video Generation with Diffusion Models","primary_cat":"cs.CV","submitted_at":"2022-10-05T14:41:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2209.03003","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","primary_cat":"cs.LG","submitted_at":"2022-09-07T08:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"05173, 2022. [19] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. [20] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020. [21] William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. [22] Ulrich G Haussmann and Etienne Pardoux. Time reversal of diffusions. The Annals of Probability, pages 1188-1205, 1986. [23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models."}],"limit":50,"offset":0}