arxiv: 2603.19857 · v2 · submitted 2026-03-20 · 💻 cs.SD · cs.CV

Recognition: 1 theorem link

· Lean Theorem

FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

You Li , Dewei Zhou , Fan Ma , Fu Li , Dongliang He , Yi Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:31 UTC · model grok-4.3

classification 💻 cs.SD cs.CV

keywords video-to-audio generationtemporal controlstructured scriptsDiT diffusionsound synthesisfoley audiomulti-event audiomultimodal generation

0 comments

The pith

FoleyDirector adds structured temporal scripts to DiT-based video-to-audio models for exact timing control over multiple events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FoleyDirector targets the lack of fine timing control in video-to-audio generation, where models often fail to place sounds correctly in multi-event scenes or when visuals are unclear. It supplies short-segment captions called Structured Temporal Scripts and fuses them into the diffusion transformer process through a dedicated attention module. This keeps the original audio quality intact while letting users toggle between free generation and scripted control. The approach also adds parallel handling for sounds inside and outside the frame. The result is a practical way for users to direct sound placement as if editing a film soundtrack.

Core claim

FoleyDirector enables precise temporal guidance in DiT-based V2A generation by introducing Structured Temporal Scripts that supply captions for short temporal segments, integrated through the Script-Guided Temporal Fusion Module using Temporal Script Attention, and supported by Bi-Frame Sound Synthesis for multi-event cases, while preserving base-model audio quality and permitting seamless mode switching.

What carries the argument

Script-Guided Temporal Fusion Module, which applies Temporal Script Attention to merge features from Structured Temporal Scripts into the generation pipeline without degrading fidelity.

If this is right

Creators gain direct control over when each sound begins and ends in complex scenes.
Off-screen and occluded sounds become reliably placeable without relying on visible cues.
The same model can produce both free and temporally directed audio tracks interchangeably.
New evaluation sets allow direct measurement of timing accuracy in addition to audio realism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The script format could transfer to other time-aligned generation tasks such as video editing or animation sound design.
Extending the scripts to include intensity or spatial cues might further increase controllability beyond timing alone.
If the fusion approach generalizes, similar steering could apply to text-to-audio or image-to-audio pipelines.

Load-bearing premise

The fusion module can incorporate the structured scripts without creating audible artifacts or lowering quality when videos contain many overlapping events or weak visual cues.

What would settle it

Generate audio from a multi-event video with scripted timings; if the output either mismatches the requested timing sequence or shows clear quality loss relative to the uncontrolled baseline model, the central claim fails.

Figures

Figures reproduced from arXiv: 2603.19857 by Dewei Zhou, Dongliang He, Fan Ma, Fu Li, Yi Yang, You Li.

**Figure 1.** Figure 1: FoleyDirector enables (a) temporal control of sound events, (b) supplemental cues when visual information is insufficient, and (c) robust handling of complex multi-event sound rendering. Abstract Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, highquality audio. However, they struggle with fine-grained temporal control in multi-event scenarios o… view at source ↗

**Figure 2.** Figure 2: Overview of our method. (a) Extraction pipeline of segment-level STS features. (b) Structure of the SG-TFM module, where Temporal Script Attention introduces control signals. (c) Bi-Frame Sound Synthesis Framework, which leverages the controllability of our method in T2A and V2A to enable parallel rendering of in-frame and out-of-frame sounds. Fused block represents the single-modal transformer block in MM… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison. We visualize the spectrograms of generated audio (by prior works and our method) and the groundtruth. The curves in the figure represent the recognition confidence for the corresponding sound categories, while the rectangular regions along the curves indicate the time intervals where our fine-grained scripts provide control. performance); b) Usability (shorter durations require mor… view at source ↗

**Figure 4.** Figure 4: Ablation on training iterations. We show the model metrics after training for different numbers of iterations. The metrics are normalized to the range of 0.3–0.8, with higher values indicating better performance. A. Model Architecture DiT Architecture. Our model architecture is designed following the framework of MMAudio, consisting of stacked multi-stream and single-stream DiT blocks. The video input (8… view at source ↗

**Figure 5.** Figure 5: System Prompt. The system prompt we used in annotation pipeline Distribution matching Quality Semantic Temporal Params FDVGG ↓ FDPANN ↓ FDPaSST↓ KLPANN↓ KLPaSST↓ ISCPANN↑ IB↑ DeSync↓ Medium 621M 1.45 8.23 92.80 1.67 1.47 14.38 0.32 0.439 Large 1.03B 1.47 8.22 93.77 1.66 1.46 14.41 0.32 0.456 Model [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Visual Results in DirectorBench. We present several results from DirectorBench. E. Benchmark Details In this section, we present the details of the benchmark we constructed. The benchmarks used in our evaluation include DirectorBench and VGGSound-Director. DirectorBench. DirectorBench primarily evaluates a model’s controllability. To build this benchmark, we collected videos from the VGGSound test set and… view at source ↗

**Figure 7.** Figure 7: Different Architecture. We present the architectural diagrams of different design variants of SG-TFM. (a) concatenates the STS token with features from other modalities; (b) first fuses the STS token with video features using a Fuse-Attention layer; (c) combines the previous two approaches; and (d) shows the design we currently adopt. egories, and overlay the Grounding model’s confidence for the temporal l… view at source ↗

**Figure 8.** Figure 8: Visual Results in VGGSound-Director. We present several results from VGGSound-Director, comparing the mel-spectrograms generated by our method with those from other approaches and with the ground-truth audio. We also compute the L1 similarity between each generated mel-spectrogram and the ground truth. tention layer fuses STS with the already fused audio features. The experimental results are shown in the… view at source ↗

**Figure 9.** Figure 9: Visual Results in DirectorBench. We present several results from DirectorBench. 5 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visual Results in VGGSound-Director. We present several results from VGGSound-Director, comparing the mel-spectrograms generated by our method with those from other approaches and with the ground-truth audio. We also compute the L1 similarity between each generated mel-spectrogram and the ground truth. The Impact of Bi-Frame. We also visualized the results of the Bi-Frame framework. It can be observed tha… view at source ↗

**Figure 11.** Figure 11: Visual Results in VGGSound-Director. We present several results from VGGSound-Director, comparing the mel-spectrograms generated by our method with those from other approaches and with the ground-truth audio. We also compute the L1 similarity between each generated mel-spectrogram and the ground truth. Arch FDVGG ↓ KLPANN↓ ISCPANN↑ IB↑ DeSync↓ Concat 1.29 1.26 14.62 0.33 0.476 Attn 1.74 1.54 15.49 0.32 0.… view at source ↗

read the original abstract

Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FoleyDirector adds structured temporal scripts and a fusion module to DiT video-to-audio for better timing control in multi-event scenes.

read the letter

The main takeaway is that this paper gives a concrete way to steer sound timing in DiT-based video-to-audio by feeding in structured temporal scripts—short captions tied to time segments—through a Script-Guided Temporal Fusion Module that uses Temporal Script Attention. It also adds Bi-Frame Sound Synthesis to generate in-frame and out-of-frame audio in parallel for complex cases, plus new datasets like DirectorSound and DirectorBench to test multi-event and low-visibility situations. The design keeps the base model untouched and lets users switch between plain V2A and controlled output without retraining.

Referee Report

3 major / 2 minor

Summary. The paper proposes FoleyDirector, a DiT-based video-to-audio (V2A) framework that introduces Structured Temporal Scripts (STS) as per-segment captions to enable fine-grained temporal control. These are fused via the Script-Guided Temporal Fusion Module using Temporal Script Attention, while Bi-Frame Sound Synthesis generates parallel in-frame and out-of-frame audio for multi-event cases. New resources DirectorSound, VGGSoundDirector, and DirectorBench are presented for training and evaluation. The central claim is that the method delivers precise temporal guidance while preserving base-model audio quality and permitting seamless switching between standard V2A and controlled synthesis.

Significance. If the quantitative claims hold, the work would meaningfully advance controllable V2A by addressing a recognized gap in multi-event and low-visibility scenarios. The modular STS representation and switchable architecture could enable practical user-directed Foley workflows, and the introduced benchmarks would provide a standardized testbed for future temporal-control research.

major comments (3)

[§3.2] §3.2 (Script-Guided Temporal Fusion Module): the Temporal Script Attention mechanism is described only at a high level; without the explicit attention formulation or integration equations, it is impossible to verify that STS features are fused without introducing artifacts or altering the base DiT distribution in low-visibility or occluded-object cases, which is load-bearing for the 'no quality degradation' claim.
[§4.2] §4.2 and Table 3 (Bi-Frame Sound Synthesis ablation): the parallel in-frame/out-of-frame generation is presented as key to multi-event controllability, yet no ablation isolating its contribution versus the fusion module alone is reported; this leaves the attribution of improved temporal metrics on DirectorBench ambiguous.
[§4.3] §4.3 (quantitative results): the abstract asserts 'substantially enhances temporal controllability while maintaining high audio fidelity,' but the visible results lack direct side-by-side comparison of perceptual quality metrics (e.g., FAD, CLAP) against the unmodified DiT baseline on the same multi-event subsets, undermining the preservation claim.

minor comments (2)

[Abstract] The term 'Foley directors' is introduced in the abstract without a brief operational definition; a short clarification in §1 would improve accessibility.
[Figure 2] Figure captions for the architecture diagram should explicitly label the data flow between STS, the fusion module, and the Bi-Frame synthesizer to match the textual description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to incorporate the suggested clarifications and additional analyses.

read point-by-point responses

Referee: [§3.2] §3.2 (Script-Guided Temporal Fusion Module): the Temporal Script Attention mechanism is described only at a high level; without the explicit attention formulation or integration equations, it is impossible to verify that STS features are fused without introducing artifacts or altering the base DiT distribution in low-visibility or occluded-object cases, which is load-bearing for the 'no quality degradation' claim.

Authors: We agree that the description of the Temporal Script Attention in §3.2 is high-level. In the revised manuscript we will add the explicit mathematical formulation, including the query-key-value projections, attention computation, and the precise integration equations showing how STS features are fused into the DiT blocks. These additions will demonstrate that the fusion is designed to preserve the base DiT distribution and does not introduce artifacts, directly supporting the no-degradation claim even in low-visibility and occluded cases. revision: yes
Referee: [§4.2] §4.2 and Table 3 (Bi-Frame Sound Synthesis ablation): the parallel in-frame/out-of-frame generation is presented as key to multi-event controllability, yet no ablation isolating its contribution versus the fusion module alone is reported; this leaves the attribution of improved temporal metrics on DirectorBench ambiguous.

Authors: The referee is correct that Table 3 does not isolate the Bi-Frame Sound Synthesis contribution from the fusion module. We will add a dedicated ablation in the revised version that fixes the Script-Guided Temporal Fusion Module and compares variants with and without Bi-Frame Sound Synthesis. This will clarify the specific contribution of the parallel in-frame/out-of-frame generation to the temporal metrics on DirectorBench. revision: yes
Referee: [§4.3] §4.3 (quantitative results): the abstract asserts 'substantially enhances temporal controllability while maintaining high audio fidelity,' but the visible results lack direct side-by-side comparison of perceptual quality metrics (e.g., FAD, CLAP) against the unmodified DiT baseline on the same multi-event subsets, undermining the preservation claim.

Authors: We acknowledge that the current quantitative results do not include direct side-by-side perceptual quality comparisons (FAD, CLAP) of the full model versus the unmodified DiT baseline specifically on multi-event subsets. In the revision we will add these comparisons on the relevant DirectorBench multi-event subsets to provide direct evidence that audio fidelity is preserved. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural additions are independent of inputs

full rationale

The paper introduces FoleyDirector as a modular extension to existing DiT-based V2A models via new components (Structured Temporal Scripts, Script-Guided Temporal Fusion Module with Temporal Script Attention, and Bi-Frame Sound Synthesis) plus supporting datasets (DirectorSound, VGGSoundDirector, DirectorBench). No equations, fitted parameters, or self-citations are presented that reduce any claimed result to its own inputs by construction. The temporal control claims rest on the explicit design of these modules rather than any re-derivation or renaming of prior quantities, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework appears to rest on standard DiT diffusion assumptions plus the new script representation.

pith-pipeline@v0.9.0 · 5538 in / 1017 out tokens · 32188 ms · 2026-05-15T07:31:39.630696+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean TemporalSequence, zAtStep, arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments... Script-Guided Temporal Fusion Module... Temporal Script Attention... Bi-Frame Sound Synthesis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
cs.CV 2026-04 unverdicted novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Black forest labs; frontier ai lab, 2024

BlackForest. Black forest labs; frontier ai lab, 2024. 1, 2

work page 2024
[2]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. 5

work page 2020
[3]

Video-guided foley sound generation with multimodal con- trols

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Ni- eto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal con- trols. 2025. 3

work page 2025
[4]

MMAu- dio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAu- dio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025. 1, 2, 3

work page 2025
[5]

Simple and controllable music generation

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D ´efossez. Simple and controllable music generation. InThirty- seventh Conference on Neural Information Processing Sys- tems, 2023. 2

work page 2023
[6]

Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024. 2

work page arXiv 2024
[7]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xi- ang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, et al. Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025. 2

work page arXiv 2025
[9]

Text-to-audio generation using instruc- tion tuned llm and latent diffusion model.arXiv preprint arXiv:2304.13731, 2023

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruc- tion tuned llm and latent diffusion model.arXiv preprint arXiv:2304.13731, 2023. 2

work page arXiv 2023
[10]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, 2023. 6

work page 2023
[11]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification, 2017. 6

work page 2017
[12]

Video-to-audio generation with fine-grained temporal semantics.arXiv preprint arXiv:2409.14709, 2024

Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, and Dong Yu. Video-to-audio generation with fine-grained temporal semantics.arXiv preprint arXiv:2409.14709, 2024. 2

work page arXiv 2024
[13]

Spotlighting partially visible cinematic language for video-to-audio gen- eration via self-distillation, 2025

Feizhen Huang, Yu Wu, Yutian Lin, and Bo Du. Spotlighting partially visible cinematic language for video-to-audio gen- eration via self-distillation, 2025. 2

work page 2025
[14]

Make-an-audio: Text-to-audio genera- tion with prompt-enhanced diffusion models.arXiv preprint arXiv:2301.12661, 2023

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio genera- tion with prompt-enhanced diffusion models.arXiv preprint arXiv:2301.12661, 2023. 2

work page arXiv 2023
[15]

Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization, 2024

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Zadeh, Chuan Li, Rafael Valle, Bryan Catan- zaro, and Soujanya Poria. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization, 2024. 2

work page 2024
[16]

Zisserman

V Iashin, W Xie, E Rahtu, and A. Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024. 2, 7

work page 2024
[17]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InNAACL-HLT, 2019. 5

work page 2019
[18]

Plumbley

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recogni- tion, 2020. 6

work page 2020
[19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Efficient training of audio transformers with patchout

Khaled Koutini, Jan Schl ¨uter, Hamid Eghbal-zadeh, and Ger- hard Widmer. Efficient training of audio transformers with patchout. InInterspeech 2022, 23rd Annual Conference of the International Speech Communication Association, In- cheon, Korea, 18-22 September 2022, pages 2753–2757. ISCA, 2022. 6

work page 2022
[21]

Audiogen: Textually guided audio gen- eration.arXiv preprint arXiv:2209.15352, 2022

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D ´efossez, Jade Copet, Devi Parikh, Yaniv Taig- man, and Yossi Adi. Audiogen: Textually guided audio gen- eration.arXiv preprint arXiv:2209.15352, 2022. 2

work page arXiv 2022
[22]

Video-foley: Two-stage video-to-sound generation via tem- poral event condition for foley sound.IEEE Transactions on Audio, Speech and Language Processing, 2025

Junwon Lee, Jaekwon Im, Dabin Kim, and Juhan Nam. Video-foley: Two-stage video-to-sound generation via tem- poral event condition for foley sound.IEEE Transactions on Audio, Speech and Language Processing, 2025. 6, 2

work page 2025
[23]

Dreamfoley: Scalable vlms for high-fidelity video-to- audio generation, 2025

Fu Li, Weichao Zhao, You Li, Zhichao Zhou, and Dongliang He. Dreamfoley: Scalable vlms for high-fidelity video-to- audio generation, 2025. 1

work page 2025
[24]

Anysynth: Harnessing the power of image synthetic data generation for generalized vision-language tasks.arXiv preprint arXiv:2411.16749, 2024

You Li, Fan Ma, and Yi Yang. Anysynth: Harnessing the power of image synthetic data generation for generalized vision-language tasks.arXiv preprint arXiv:2411.16749, 2024

work page arXiv 2024
[25]

Imagine and seek: Improv- ing composed image retrieval with an imagined proxy

You Li, Fan Ma, and Yi Yang. Imagine and seek: Improv- ing composed image retrieval with an imagined proxy. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3984–3993, 2025. 1

work page 2025
[26]

Audi- oLDM: Text-to-audio generation with latent diffusion mod- els.Proceedings of the International Conference on Machine Learning, pages 21450–21474, 2023

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audi- oLDM: Text-to-audio generation with latent diffusion mod- els.Proceedings of the International Conference on Machine Learning, pages 21450–21474, 2023. 2

work page 2023
[27]

Plumbley

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. Audioldm 2: Learning holistic au- dio generation with self-supervised pretraining.IEEE/ACM 9 Transactions on Audio, Speech, and Language Processing, 32:2871–2883, 2024. 2

work page 2024
[28]

Flashaudio: Rectified flows for fast and high-fidelity text-to-audio generation, 2025

Huadai Liu, Jialei Wang, Rongjie Huang, Yang Liu, Heng Lu, Zhou Zhao, and Wei Xue. Flashaudio: Rectified flows for fast and high-fidelity text-to-audio generation, 2025. 2

work page 2025
[29]

Thinksound: Chain-of- thought reasoning in multimodal large language models for audio generation and editing

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain-of- thought reasoning in multimodal large language models for audio generation and editing. 2025. 3

work page 2025
[30]

Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models, 2023

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models, 2023. 1, 2

work page 2023
[31]

Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens, 2025

Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens, 2025. 1

work page 2025
[32]

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from nat- ural language supervision. InICML, 2021. 4

work page 2021
[33]

High-resolution image syn- thesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 1

work page 2022
[34]

Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation, 2025. 2, 3, 4

work page 2025
[35]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream Team, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Temporally aligned audio for video with autoregression

Ilpo Viertola, Vladimir Iashin, and Esa Rahtu. Temporally aligned audio for video with autoregression. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025. 3

work page 2025
[37]

Audiobox: Unified audio generation with natural language prompts, 2023

Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, and Wei- Ning Hsu. Audiobox: Unifie...

work page 2023
[38]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Kling-foley: Multimodal diffusion trans- former for high-quality video-to-audio generation

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, et al. Kling-foley: Multimodal diffusion trans- former for high-quality video-to-audio generation. 2025. 2, 3

work page 2025
[40]

Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in Neural Information Processing Systems, 37:128118–128138, 2024

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao. Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in Neural Information Processing Systems, 37:128118–128138, 2024. 3

work page 2024
[41]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni: A family of omni-modal large language models.arXiv preprint arXiv:2503.20215, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Video-to-audio generation with hidden alignment, 2024

Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment, 2024. 3

work page 2024
[44]

Con- textgen: Contextual layout anchoring for identity-consistent multi-instance generation.arXiv preprint arXiv:2510.11000,

Ruihang Xu, Dewei Zhou, Fan Ma, and Yi Yang. Con- textgen: Contextual layout anchoring for identity-consistent multi-instance generation.arXiv preprint arXiv:2510.11000,

work page arXiv
[45]

Towards weakly supervised text-to-audio grounding.arXiv preprint arXiv:2401.02584, 2024

Xuenan Xu, Ziyang Ma, Mengyue Wu, and Kai Yu. Towards weakly supervised text-to-audio grounding.arXiv preprint arXiv:2401.02584, 2024. 6

work page arXiv 2024
[46]

Diffsound: Discrete diffusion model for text-to-sound generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720–1733, 2023

Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720–1733, 2023. 2

work page 2023
[47]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds, 2024

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds, 2024. 2

work page 2024
[49]

Migc++: Advanced multi-instance generation controller for image synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dewei Zhou, You Li, Fan Ma, Zongxin Yang, and Yi Yang. Migc++: Advanced multi-instance generation controller for image synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1, 3

work page 2024
[50]

Migc: Multi-instance generation controller for text-to-image synthesis

Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, and Yi Yang. Migc: Multi-instance generation controller for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6818– 6828, 2024. 3

work page 2024
[51]

3dis: Depth-driven decoupled instance synthesis for text-to-image generation.arXiv preprint arXiv:2410.12669, 2024

Dewei Zhou, Ji Xie, Zongxin Yang, and Yi Yang. 3dis: Depth-driven decoupled instance synthesis for text-to-image generation.arXiv preprint arXiv:2410.12669, 2024. 10

work page arXiv 2024
[52]

Bidedpo: Condi- tional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025

Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, and Yi Yang. Bidedpo: Condi- tional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025

work page arXiv 2025
[53]

Dreamrenderer: Taming multi-instance attribute control in large-scale text-to-image models

Dewei Zhou, Mingwei Li, Zongxin Yang, and Yi Yang. Dreamrenderer: Taming multi-instance attribute control in large-scale text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16712–16722, 2025

work page 2025
[54]

3dis-flux: simple and efficient multi-instance generation with dit ren- dering.arXiv preprint arXiv:2501.05131, 2025

Dewei Zhou, Ji Xie, Zongxin Yang, and Yi Yang. 3dis-flux: simple and efficient multi-instance generation with dit ren- dering.arXiv preprint arXiv:2501.05131, 2025. 1

work page arXiv 2025
[55]

Masked audio generation using a single non- autoregressive transformer, 2024

Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. Masked audio generation using a single non- autoregressive transformer, 2024. 2 11 FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts Supplementary Material Figure 4.Ablation on trainin...

work page 2024
[56]

Analyze the audio in the context of the video to form a rich overall perception, considering multiple aspects such as content, timbre, volume, pitch, texture, and environment

work page
[57]

Generate a concise and informative caption summarizing your perception: caption:

work page
[58]

Based on the caption and the audio, identify and locate distinct sound events, including their timing

work page
[59]

(b) Segment-level Classification You are an audio-visual analysis expert

List all detected sound categories in the format: category: sound1, sound2, ..., separated by commas. (b) Segment-level Classification You are an audio-visual analysis expert. Analyze the provided audio clip along with its corresponding video clip, and determine whether any of the specified categories are present

work page
[60]

Yes") or absent (

For each category in the provided list, indicate whether the category is present in the clip ("Yes") or absent ("No"), based primarily on the audio, but ensure your conclusion is consistent with the visual content

work page
[61]

soft", "medium

If a category is present, you may optionally provide additional details such as: a) Loudness: "soft", "medium", or "loud“ b) Timbre: a brief description of the sound quality c) Desc: a brief caption of sound Output the result in the following JSON format only Figure 5.System Prompt.The system prompt we used in annotation pipeline Distribution matching Qua...

work page
[62]

During training, we randomly drop TSR features with a probability of 0.1 We set an initial learning rate of2.0×10 −5

We train our model on 8×A800 GPUs (40 GB each) us- ing the MMAudio-medium architecture, with a total train- ing time of approximately 72 hours. During training, we randomly drop TSR features with a probability of 0.1 We set an initial learning rate of2.0×10 −5. We use the AdamW optimizer and a cosine learning rate decay schedule. We apply a weight decay o...

work page