Recognition: 2 theorem links
· Lean TheoremSyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
Pith reviewed 2026-05-13 06:55 UTC · model grok-4.3
The pith
SyncDPO uses direct preference optimization with on-the-fly temporal distortions to improve video-audio synchronization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SyncDPO is a post-training framework that leverages Direct Preference Optimization to enhance temporal sensitivity in video-audio joint generation models. It replaces costly preference pair construction with on-the-fly rule-based strategies that distort temporal structures, and employs curriculum learning to gradually increase the subtlety of misalignments. This approach yields models with improved temporal alignment and better out-of-distribution generalization compared to supervised fine-tuning baselines.
What carries the argument
On-the-fly rule-based negative construction strategies for creating temporally misaligned video-audio pairs to serve as negatives in Direct Preference Optimization, supported by a progressive curriculum.
If this is right
- The resulting models exhibit stronger temporal alignment on in-distribution benchmarks.
- Generalization improves on out-of-distribution test sets by better capturing motion-sound relationships.
- Training remains computationally efficient by avoiding separate sampling and ranking steps.
- The method applies across varied domains including ambient sounds and speech videos.
Where Pith is reading between the lines
- Similar on-the-fly negative construction could accelerate preference tuning in other sequence alignment problems.
- The curriculum design may offer a template for stabilizing preference optimization when negative quality varies.
- If effective, this reduces reliance on human or model-based ranking for creating training signals in multimodal settings.
Load-bearing premise
The synthetic temporal distortions generated by the rules are representative enough of real misalignments to train the model effectively through preference comparisons.
What would settle it
Observing no statistically significant difference in temporal synchronization metrics between SyncDPO and baseline models when evaluated on the four benchmarks would indicate the approach does not deliver the claimed improvements.
Figures
read the original abstract
Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SyncDPO, a post-training framework that applies Direct Preference Optimization (DPO) to video-audio joint generation models to improve temporal synchronization. It replaces costly sampling-and-ranking for preference pairs with on-the-fly rule-based negative construction strategies that apply temporal distortions such as shifts, stretches, and drops. A curriculum learning schedule progressively increases negative difficulty from coarse to subtle misalignments. The authors claim this yields superior temporal alignment on four benchmarks spanning ambient sound and human speech videos, plus better out-of-distribution generalization, with code and demos released.
Significance. If the results hold, the work offers an efficient route to post-train multimodal generators for fine-grained timing without extra annotation or heavy inference-time sampling. The explicit release of code and demo is a clear strength for reproducibility. The approach targets a genuine gap where semantic correspondence is already strong but precise event-level synchronization remains weak.
major comments (1)
- [Methods (negative construction strategies)] The central assumption that the on-the-fly rule-based negative construction strategies (temporal shifts, stretches, drops) produce misalignment distributions equivalent to those arising during actual generative sampling is load-bearing for the entire claim. The manuscript provides no analysis or ablation demonstrating that these synthetic distortions capture the correlated visual-audio artifacts typical of diffusion or autoregressive sampling errors, rather than merely varying difficulty inside an artificial family.
minor comments (2)
- [Abstract] The abstract asserts outperformance across benchmarks but omits concrete metric values, baseline names, and any mention of statistical tests or variance, which reduces immediate assessability of the empirical claims.
- Notation for the preference pairs and the curriculum schedule should be introduced with explicit equations or pseudocode to improve clarity for readers unfamiliar with the exact distortion operators.
Simulated Author's Rebuttal
We thank the referee for this valuable comment on our negative construction approach. We provide a detailed response below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods (negative construction strategies)] The central assumption that the on-the-fly rule-based negative construction strategies (temporal shifts, stretches, drops) produce misalignment distributions equivalent to those arising during actual generative sampling is load-bearing for the entire claim. The manuscript provides no analysis or ablation demonstrating that these synthetic distortions capture the correlated visual-audio artifacts typical of diffusion or autoregressive sampling errors, rather than merely varying difficulty inside an artificial family.
Authors: We agree that a more rigorous validation of the negative construction strategies would bolster the paper. Our strategies are motivated by common temporal misalignment patterns seen in generated video-audio pairs, including those from diffusion models, such as audio lagging behind visual events or mismatched durations. Although we did not include a direct distributional comparison in the original submission, the superior performance on real benchmarks and OOD generalization suggest that the approach effectively targets relevant misalignment types. In the revised version, we will add an analysis section with examples of base model failures and how our negatives relate to them, along with an ablation on the curriculum stages. This will demonstrate that the distortions are not merely artificial but capture key aspects of the problem. revision: yes
Circularity Check
No significant circularity; builds on established DPO with novel rule-based negatives
full rationale
The paper's derivation applies the standard DPO loss to video-audio pairs using explicitly defined on-the-fly rule-based distortions (temporal shifts, stretches, drops) for negative samples and a curriculum schedule. These construction rules are stated as independent of model outputs and do not reduce any claimed prediction to a fitted parameter or self-definition. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the method; the central claim of improved temporal sensitivity rests on the new negative-construction procedure rather than on prior author results. Experimental results are reported as external validation, not as part of the derivation chain itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Direct Preference Optimization improves sensitivity to temporal misalignments when provided with explicit negative pairs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures... curriculum learning strategy that progressively increases the difficulty of negative samples
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LDPO(πθ;πref) = −E log σ(β log πθ(xw|y)/πref(xw|y) − β log πθ(xl|y)/πref(xl|y))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition.IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018
work page 2018
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009
work page 2009
-
[4]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators
work page 2024
-
[6]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017
work page 2017
-
[7]
Vggsound: A large-scale audio-visual dataset
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020
work page 2020
-
[8]
MMAudio: Taming multimodal joint training for high-quality video-to-audio synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025
work page 2025
-
[9]
Lova: Long-form video-to-audio generation.arXiv preprint arXiv:2409.15157, 2024
Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, and Ruihua Song. Lova: Long-form video-to-audio generation.arXiv preprint arXiv:2409.15157, 2024
-
[10]
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, and Ruihua Song. Vssflow: Unifying video-conditioned sound and speech generation via joint learning.arXiv preprint arXiv:2509.24773, 2025
-
[11]
J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. InWorkshop on Multi-view Lip-reading, ACCV, 2016
work page 2016
-
[12]
Curriculum direct preference optimization for diffusion and consistency models
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, and Mubarak Shah. Curriculum direct preference optimization for diffusion and consistency models. InProceedings of CVPR, page [TBA], 2025
work page 2025
-
[13]
Clap learning audio concepts from natural language supervision
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 10
work page 2023
-
[14]
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023
work page 2023
-
[15]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, 2023
work page 2023
-
[16]
Veo 3 AI Video Generator with Realistic Sound.https://www.veo3.io/, 2025
Google. Veo 3 AI Video Generator with Realistic Sound.https://www.veo3.io/, 2025
work page 2025
-
[17]
Taming text-to-sounding video generation via advanced modality condition and interaction, 2025
Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, and Meng Cao. Taming text-to-sounding video generation via advanced modality condition and interaction.arXiv preprint arXiv:2510.03117, 2025
-
[18]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
VABench: A Comprehensive Benchmark for Audio-Video Generation
Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, and Kai Han. Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025
-
[21]
Synchformer: Efficient synchronization from sparse cues
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024
work page 2024
-
[22]
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020
work page 2020
-
[23]
Efficient training of audio transformers with patchout
Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. Efficient training of audio transformers with patchout. InInterspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2753–2757. ISCA, 2022. doi: 10.21437/Interspeech.2022-227
-
[24]
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step- by-step preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13199–13208, 2025
work page 2025
-
[25]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review arXiv 2025
-
[27]
Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025
-
[28]
Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation.arXiv preprint arXiv:2602.19163, 2026
-
[29]
Videodpo: Omni-preference alignment for video diffusion generation
Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 11
work page 2025
-
[30]
Ovi: Twin backbone cross-modal fusion for audio-video generation
Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025
-
[31]
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413, 2016
work page 2016
-
[32]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[33]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
work page 2023
-
[34]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[35]
Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10219–10228, 2023
work page 2023
-
[36]
Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,
Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.arXiv preprint arXiv:2204.02152, 2022
-
[37]
Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
-
[38]
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. Any-to-any generation via composable diffusion.Advances in Neural Information Processing Systems, 36: 16083–16099, 2023
work page 2023
-
[39]
Mochi 1.https://github.com/genmoai/models, 2024
Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024
work page 2024
-
[40]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
work page 2024
-
[41]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025
-
[43]
Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024
-
[44]
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025
work page 2025
-
[45]
Xihua Wang, Ruihua Song, Chongxuan Li, Xin Cheng, Boyuan Li, Yihan Wu, Yuyue Wang, Hongteng Xu, and Yunfeng Wang. Animate and sound an image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23369–23378, 2025. 12
work page 2025
-
[46]
ESPnet: End-to-end speech processing toolkit
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Ren- duchintala, and Tsubasa Ochiai. ESPnet: End-to-end speech processing toolkit. InProceed- ings of Interspeech, pages 2207–2211, 2018. doi: 10.21437/Interspeech.2018-1456. URL http://...
-
[47]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024
work page 2024
-
[49]
Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024
work page 2024
-
[50]
Onlinevpo: Align video diffusion model with online video-centric preference optimization,
Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization. arXiv preprint arXiv:2412.15159, 2024
-
[51]
Audio-synchronized visual animation
Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Morgado. Audio-synchronized visual animation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024
work page 2024
-
[52]
Uniform: A unified multi-task diffusion transformer for audio-video generation
Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, and Xuelong Li. Uniform: A unified multi-task diffusion transformer for audio-video generation. arXiv preprint arXiv:2502.03897, 2025. 13 A Experiment details A.1 Benchmarks and training data Benchmarks.We evaluate on four benchmarks covering both human-speech and ambien...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.