Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
Pith reviewed 2026-05-20 05:53 UTC · model grok-4.3
The pith
Rebalancing attention from generated frames to the reference frame during early denoising increases motion in image-to-video models without retraining or loss of fidelity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We identify reference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS, a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength.
What carries the argument
DyMoS (Dynamic Motion Slider), the training-free scalar adjustment that reduces attention weight from non-reference frames to reference-frame keys during the first denoising steps.
If this is right
- Motion dynamics improve consistently across multiple state-of-the-art image-to-video backbones.
- Visual quality and fidelity to the reference image are preserved.
- No additional training or modification of model weights is required.
- A single scalar parameter enables continuous, user-controllable adjustment of motion strength.
- Intervention limited to initial denoising steps avoids introducing temporal inconsistencies in later steps.
Where Pith is reading between the lines
- The same attention-rebalancing idea could be tested on text-to-video models to modulate object or camera motion.
- Attention maps in the early denoising phase may serve as a general diagnostic for other temporal artifacts in diffusion video models.
- Integrating the scalar into user interfaces would let practitioners tune motion per generation without pipeline changes.
- The finding suggests that targeted early-step attention edits might generalize to controlling other conditioning signals beyond the reference image.
Load-bearing premise
Excessive self-attention from non-reference frames to reference-frame key tokens is the primary driver of motion suppression, and rebalancing it only in the initial denoising steps is sufficient to restore dynamics without later artifacts or inconsistencies.
What would settle it
Running the same set of reference images through baseline and DyMoS-augmented models and finding no measurable increase in average inter-frame optical flow magnitude while reference-image similarity scores remain unchanged.
Figures
read the original abstract
Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify \emph{reference-frame dominance} as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS~(Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies reference-frame dominance in image-to-video diffusion models, where non-reference frames over-allocate self-attention to reference-frame key tokens, thereby suppressing inter-frame motion. It introduces DyMoS, a training-free and model-agnostic intervention that rebalances this attention pathway only during initial denoising steps via a single scalar motion-strength parameter, claiming consistent improvements in motion dynamics across multiple state-of-the-art I2V backbones while preserving visual quality and reference-image fidelity.
Significance. If the proposed mechanism and intervention prove robust, DyMoS would supply a simple, zero-training-cost control knob for motion strength that leaves both the reference image and model weights untouched. This directly targets a widespread practical limitation of current I2V systems and could be adopted as a post-hoc module by practitioners. The training-free, model-agnostic design and explicit scalar parameter constitute clear strengths if supported by reproducible quantitative evidence.
major comments (3)
- Abstract: the central claim that excessive self-attention from non-reference frames to reference-frame keys is the primary causal mechanism for motion suppression rests on correlational observation rather than interventional evidence; no controlled perturbation of attention weights independent of the DyMoS scalar is reported to isolate this pathway from other conditioning or diffusion dynamics.
- Abstract / Experiments: the manuscript states that DyMoS 'consistently improves motion dynamics' across backbones yet supplies no quantitative metrics, error bars, ablation tables, or dataset descriptions, leaving the magnitude and reliability of the reported gains unsupported.
- Method: the decision to restrict rebalancing to initial denoising steps is motivated by the same attention observation, but the paper provides neither timestep-resolved attention maps nor ablations confirming that reference-frame dominance does not re-emerge or that later steps remain unaffected, undermining the sufficiency argument.
minor comments (2)
- Clarify the precise mathematical formulation of how the motion-strength scalar modifies the attention scores (e.g., which keys/values are scaled and by what factor) to improve reproducibility.
- Include attention-map visualizations at multiple timesteps and across generated frames to directly illustrate the claimed reference-frame dominance before and after DyMoS.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that strengthen the evidentiary basis for our claims about reference-frame dominance and DyMoS.
read point-by-point responses
-
Referee: Abstract: the central claim that excessive self-attention from non-reference frames to reference-frame keys is the primary causal mechanism for motion suppression rests on correlational observation rather than interventional evidence; no controlled perturbation of attention weights independent of the DyMoS scalar is reported to isolate this pathway from other conditioning or diffusion dynamics.
Authors: We agree the initial identification relies on observational attention analysis. DyMoS functions as a direct intervention by selectively scaling the reference-frame key contributions in self-attention. The method's consistent motion improvements across backbones provide supporting evidence for the mechanism. To isolate the pathway more rigorously, we will add controlled ablation experiments that apply targeted attention perturbations without using the full DyMoS scalar in the revised manuscript. revision: partial
-
Referee: Abstract / Experiments: the manuscript states that DyMoS 'consistently improves motion dynamics' across backbones yet supplies no quantitative metrics, error bars, ablation tables, or dataset descriptions, leaving the magnitude and reliability of the reported gains unsupported.
Authors: We will expand the experiments section to include quantitative motion metrics (e.g., optical flow magnitude and inter-frame difference scores), error bars from repeated runs, comprehensive ablation tables varying the motion-strength parameter and application window, and full dataset descriptions with evaluation protocols. These additions will be incorporated in the revision to substantiate the reported gains. revision: yes
-
Referee: Method: the decision to restrict rebalancing to initial denoising steps is motivated by the same attention observation, but the paper provides neither timestep-resolved attention maps nor ablations confirming that reference-frame dominance does not re-emerge or that later steps remain unaffected, undermining the sufficiency argument.
Authors: We will add timestep-resolved attention visualizations across the full denoising trajectory to demonstrate the temporal dynamics of reference-frame dominance. We will also include ablations comparing DyMoS applied only in early steps versus all steps or late steps, confirming that dominance does not re-emerge later and that restricting the intervention preserves quality without side effects. revision: yes
Circularity Check
No significant circularity: empirical observation leads to interventional method with explicit parameter
full rationale
The paper identifies reference-frame dominance via direct observation of attention allocation in existing I2V models, then introduces DyMoS as a training-free rebalancing intervention restricted to initial denoising steps. It adds a single scalar parameter for continuous motion control and validates the approach through experiments on multiple backbones while preserving image fidelity. No step reduces a claimed result to fitted inputs by construction, no self-definitional loop exists between the observed mechanism and the proposed fix, and the abstract and description contain no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The derivation remains self-contained as an observational finding followed by an explicit, testable modification rather than a closed mathematical reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- motion strength scalar
axioms (1)
- domain assumption Self-attention in I2V diffusion models propagates reference information across generated frames via key-token attention.
invented entities (1)
-
reference-frame dominance
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we add a scalar bias to the attention logits from non-reference-frame query tokens to reference-frame key tokens before the softmax operation: ˜L[i, j] = L[i, j]−γ·1[j∈If0]·1[i∉If0]
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and 8-tick period forcing unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DyMoS incorporates two key design choices. First, we apply the modulation only during the first λ∈[0,1] fraction of sampling steps
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[2]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS
work page 2021
-
[3]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015
work page 2015
-
[4]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021
work page 2021
-
[5]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t
work page 2023
-
[7]
Building normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=li7qeBbCR1t
work page 2023
-
[8]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[9]
HunyuanVideo 1.5 Technical Report
Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. URL https://arxiv.org/abs/2511.18870
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...
work page 2025
-
[12]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022
work page 2022
-
[14]
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 10
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Animatediff: Animate your personalized text-to-image diffusion models without specific tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Fx2SbBgcte
work page 2024
-
[17]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Conditional image-to-video generation with latent flow diffusion models
Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18444–18455. IEEE, 2023
work page 2023
-
[19]
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling
Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024
work page 2024
-
[20]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Dynamicrafter: Animating open-domain images with video diffusion priors
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024
work page 2024
-
[22]
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models.arXiv preprint arXiv:2311.04145, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Identifying and solving conditional image leakage in image-to-video diffusion model
Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, and Jun Zhu. Identifying and solving conditional image leakage in image-to-video diffusion model. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=o9Lkiv1qpc
work page 2024
-
[24]
Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, and Li Yuan. Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025
-
[25]
June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, and Kimin Lee. Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025
-
[26]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[27]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024
work page 2024
-
[28]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[29]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[30]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
History-guided video diffusion
Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. InForty-second International Conference on Machine Learning,
-
[32]
URLhttps://openreview.net/forum?id=j8Vr3E3vhy. 11
-
[33]
T2v- compbench: A comprehensive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025
work page 2025
-
[34]
Byungjun Kim, Soobin Um, and Jong Chul Ye. Motioncfg: Boosting motion dynamics via stochastic concept perturbation.arXiv preprint arXiv:2603.14073, 2026
-
[35]
Motion prior distillation in time reversal sampling for generative inbetweening
Wooseok Jeon, Seunghyun Shin, Dongmin Shin, and Hae-Gon Jeon. Motion prior distillation in time reversal sampling for generative inbetweening. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= GRElsj9W2t
work page 2026
-
[36]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[37]
Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 2002
work page 2002
-
[38]
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024
work page 2024
-
[39]
Internvid: A large-scale video-text dataset for multimodal understanding and generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.n...
work page 2024
-
[40]
Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026
work page 2026
-
[41]
Perception encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. In The...
work page 2026
-
[42]
Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024
work page 2024
-
[43]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[44]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020
work page 2020
-
[45]
Amazon mechanical turk: A research tool for organizations and information systems scholars
Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShaping the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, December 13-14, 2012. Proceedings, 2012. 12 A Full algorithm of DyMoS Algorithm 1DyMoS Input: Reference image Iref, text prompt c, total i...
work page 2012
-
[46]
Motion:Which video has the most dynamic and realistic motion? Examples include water ripples, cloth movement, human action, and camera motion
-
[47]
Fidelity:Which video best preserves the appearance of the reference image throughout the sequence? Examples include the subject, background, and colors. 3.Text alignment:Which video most faithfully reflects the content described in the text prompt? 4.Overall preference:Overall, which video do you prefer? We collect 30 responses for each question over 25 r...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.