pith. machine review for the scientific record. sign in

arxiv: 2602.20981 · v3 · submitted 2026-02-24 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video-to-audio generationlength generalizationhierarchical networksMambamultimodal alignmentlong-form audio
0
0 comments X

The pith

Models trained only on short video clips can generate coherent audio exceeding five minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video-to-audio models can generalize from short training examples to long test videos without ever seeing extended durations during training. It introduces MMHNet, which adds a hierarchical structure and non-causal Mamba blocks to align video frames with audio over extended time spans. This removes the requirement for scarce long multimodal training pairs and mitigates distribution shift. Experiments show improved results on long-video benchmarks and successful generation beyond five minutes where earlier methods fail.

Core claim

A multimodal hierarchical network augmented with non-causal Mamba enables length generalization in video-to-audio generation, so that training exclusively on short instances produces usable audio for videos longer than five minutes at test time.

What carries the argument

MMHNet: a multimodal hierarchical network that combines hierarchical feature processing with non-causal Mamba blocks to capture long-range video-audio temporal dependencies.

Load-bearing premise

The hierarchical structure plus non-causal Mamba can maintain alignment and coherence across long video-audio sequences even when no long examples were present in training.

What would settle it

Run the model on videos several times longer than the training clips and check whether audio-video synchronization scores or human coherence ratings collapse compared with short-video results.

Figures

Figures reproduced from arXiv: 2602.20981 by Akio Hayakawa, Christian Simon, Dongseok Shim, Koichi Saito, Masato Ishii, Shusuke Takahashi, Shuyang Cui, Takashi Shibuya, Wei-Yao Wang, Yuki Mitsufuji, Zhi Zhong.

Figure 1
Figure 1. Figure 1: Long-Video to Audio (LV2A) task overview. The chal [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We analyze the role of positional embeddings in V2A models such as MMAudio [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our proposed framework. Left: A comprehensive end-to-end flow-matching model that operates across both [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of audio spectogram from MMHNet and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison with past methods on various duration splits [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MMHNet, a multimodal hierarchical extension of video-to-audio models that combines hierarchical temporal modeling with non-causal Mamba. It claims this architecture enables length generalization: models trained exclusively on short clips can generate coherent audio for videos exceeding 5 minutes at test time, without any long-duration training data, and that this 'proves' training-short/testing-long is feasible while outperforming prior video-to-audio methods on long-video benchmarks.

Significance. If the length-generalization results hold with proper validation, the work would address a key scaling bottleneck in multimodal generation—limited availability of long-form aligned video-audio data—by allowing extrapolation beyond training lengths. The hierarchical + non-causal Mamba design offers a plausible route to maintaining temporal alignment and acoustic consistency over extended durations, which could have practical impact on video editing and long-form content synthesis.

major comments (2)
  1. Abstract: The central claim that 'we prove that training short and testing long is possible' and that the method 'significantly improves long audio generation up to more than 5 minutes' is presented without any quantitative metrics, baselines, ablation studies, or experimental protocol. This absence is load-bearing because the title, abstract, and contribution rest entirely on an empirical demonstration of length generalization that cannot be assessed from the given text.
  2. Abstract: The assumption that the hierarchical MMHNet structure plus non-causal Mamba prevents error accumulation and distribution shift when extrapolating far beyond training lengths (e.g., evolving scene dynamics or cumulative audio drift) is stated but not supported by any mechanism description, theoretical argument, or empirical test. This is load-bearing for the length-generalization claim.
minor comments (1)
  1. Abstract: Phrases such as 'remarkable results' and 'beating prior works' are used without naming the specific benchmarks, comparison methods, or quantitative improvements, reducing clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments regarding the abstract below, providing clarifications from the full paper and indicating planned revisions to strengthen the presentation of our claims and mechanisms.

read point-by-point responses
  1. Referee: Abstract: The central claim that 'we prove that training short and testing long is possible' and that the method 'significantly improves long audio generation up to more than 5 minutes' is presented without any quantitative metrics, baselines, ablation studies, or experimental protocol. This absence is load-bearing because the title, abstract, and contribution rest entirely on an empirical demonstration of length generalization that cannot be assessed from the given text.

    Authors: We acknowledge that the abstract lacks specific quantitative metrics and protocol details, which limits immediate assessment. The full manuscript (Sections 4 and 5) reports results on long-video-to-audio benchmarks, including comparisons against prior video-to-audio methods with metrics such as audio quality scores and temporal alignment measures, demonstrating coherent generation beyond 5 minutes without long-duration training data. We will revise the abstract to include representative quantitative improvements and a brief reference to the benchmarks and setup. revision: yes

  2. Referee: Abstract: The assumption that the hierarchical MMHNet structure plus non-causal Mamba prevents error accumulation and distribution shift when extrapolating far beyond training lengths (e.g., evolving scene dynamics or cumulative audio drift) is stated but not supported by any mechanism description, theoretical argument, or empirical test. This is load-bearing for the length-generalization claim.

    Authors: The method section details how the hierarchical temporal modeling captures multi-scale dependencies while non-causal Mamba enables bidirectional long-range context without causal accumulation of errors. We will add a concise mechanism description to the abstract and expand the paper with additional empirical tests (e.g., drift analysis over extended sequences) and a brief theoretical note on Mamba's state-space properties for extrapolation. This revision will better support the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical result with no derivation chain or self-referential definitions

full rationale

The paper advances an empirical architecture (MMHNet: hierarchical extension of video-to-audio models using non-causal Mamba) and reports experimental outcomes on long-video benchmarks, including generation beyond 5 minutes when trained only on short clips. No equations, parameter-fitting steps, or formal derivations appear in the provided text. The central claim (training short, testing long is possible) is presented as an observed experimental outcome rather than a mathematical reduction to the model's own inputs or prior self-citations. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renamings of known results are invoked. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details remain at the level of high-level architecture names.

pith-pipeline@v0.9.0 · 5517 in / 955 out tokens · 37084 ms · 2026-05-15T19:50:57.624646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 8 internal anchors

  1. [1]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 6

  2. [2]

    Vggsound: A large-scale audio-visual dataset

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 6

  3. [3]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large lan- guage models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023. 2, 3

  4. [4]

    Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025. 1, 2, 3, 4, 6, 7, 8

  5. [5]

    Lova: Long-form video-to-audio generation

    Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, and Rui- hua Song. Lova: Long-form video-to-audio generation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025. 2, 6, 7

  6. [6]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: General- ized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024. 2, 3, 4, 5, 8

  7. [7]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  8. [8]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, Ryan C Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE, 2017. 6

  9. [9]

    Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline

    Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, and Feng Zheng. Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22942–22951, 2023. 1, 2, 3, 6, 7

  10. [10]

    Longvale: Vision-audio- language-event benchmark towards time-aware omni-modal perception of long videos

    Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. Longvale: Vision-audio- language-event benchmark towards time-aware omni-modal perception of long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18959– 18969, 2025. 1, 2, 6, 7

  11. [11]

    Imagebind: One embedding space to bind them all.arXiv preprint arXiv:2305.05665, 2023

    Rohit Girdhar, Alexander Kirillov, Mathilde Caron, Ross Girshick, Piotr Doll ´ar, and Ishan Misra. Imagebind: One embedding space to bind them all.arXiv preprint arXiv:2305.05665, 2023. 7

  12. [12]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2

  13. [13]

    Lm-infinite: Zero-shot extreme length generalization for large language models

    Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models. InProceed- ings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pages 3991–4008, 2024. 3

  14. [14]

    Mambavision: A hybrid mamba-transformer vision backbone

    Ali Hatamizadeh and Jan Kautz. Mambavision: A hybrid mamba-transformer vision backbone. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25261–25270, 2025. 2

  15. [15]

    Zigma: A dit-style zigzag mamba diffusion model

    Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Bjorn Ommer. Zigma: A dit-style zigzag mamba diffusion model. InArxiv, 2024. 2

  16. [16]

    Dynamic chunking for end-to-end hierarchical sequence modeling

    Sukjun Hwang, Brandon Wang, and Albert Gu. Dynamic chunking for end-to-end hierarchical sequence modeling. arXiv preprint arXiv:2507.07955, 2025. 2, 6

  17. [17]

    Taming visually guided sound generation.arXiv preprint arXiv:2110.08791, 2021

    Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation.arXiv preprint arXiv:2110.08791, 2021. 1, 2

  18. [18]

    Synchformer: Efficient synchronization from sparse cues

    Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 2, 4, 6, 7

  19. [19]

    The impact of positional encoding on length generalization in transform- ers.Advances in Neural Information Processing Systems, 36: 24892–24928, 2023

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Nate- san Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transform- ers.Advances in Neural Information Processing Systems, 36: 24892–24928, 2023. 2

  20. [20]

    Panns: Large-scale pretrained audio neural networks for audio pattern recogni- tion

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recogni- tion. InIEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, pages 2880–2894. IEEE, 2020. 6, 7

  21. [21]

    Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069, 2021

    Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069, 2021. 6

  22. [22]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 4

  23. [23]

    Diff-bgm: A diffusion model for video background mu- sic generation

    Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, and Yang Liu. Diff-bgm: A diffusion model for video background mu- sic generation. InCVPR, 2024. 1

  24. [24]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4 9

  25. [25]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 3, 4

  26. [26]

    Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023

    Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023. 1, 2

  27. [27]

    Text-to- audio generation synchronized with videos.arXiv preprint arXiv:2403.07938, 2024

    Shentong Mo, Jing Shi, and Yapeng Tian. Text-to- audio generation synchronized with videos.arXiv preprint arXiv:2403.07938, 2024. 2

  28. [28]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  29. [29]

    YaRN: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, 2024. 2

  30. [30]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2018. 4

  31. [31]

    Plumbley, Thomas Blumensath, Laurent Daudet, R´emi Gribonval, and Mike E

    Mark D. Plumbley, Thomas Blumensath, Laurent Daudet, R´emi Gribonval, and Mike E. Davies. Sparse representa- tions in audio and music: From coding to source separation. Proceedings of the IEEE, 98(6):995–1005, 2010. 2

  32. [32]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 4, 6

  33. [33]

    Soundreac- tor: Frame-level online video-to-audio generation.arXiv preprint arXiv:2510.02110, 2025

    Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, and Yuki Mitsufuji. Soundreac- tor: Frame-level online video-to-audio generation.arXiv preprint arXiv:2510.02110, 2025. 2

  34. [34]

    Ssamba: Self-supervised audio representation learning with mamba state space model

    Siavash Shams, Sukru Samet Dindar, Xilin Jiang, and Nima Mesgarani. Ssamba: Self-supervised audio representation learning with mamba state space model. In2024 IEEE Spoken Language Technology Workshop (SLT), pages 1053–

  35. [35]

    Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation, 2025

    Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation, 2025. 2, 6, 7

  36. [36]

    I hear your true colors: Im- age guided audio generation

    Roy Sheffer and Yossi Adi. I hear your true colors: Im- age guided audio generation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 1, 2, 3

  37. [37]

    Vssd: Vision mamba with non-causal state space duality

    Yuheng Shi, Minjing Dong, Mingjia Li, and Chang Xu. Vssd: Vision mamba with non-causal state space duality. arXiv preprint arXiv:2407.18559, 2024. 3, 4, 8

  38. [38]

    Titan-guide: Taming inference-time alignment for guided text-to-video diffusion models

    Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong, Shusuke Takahashi, Takashi Shibuya, and Yuki Mitsufuji. Titan-guide: Taming inference-time alignment for guided text-to-video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16662–16671, 2025. 2

  39. [39]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021. 4

  40. [40]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  41. [41]

    Fourier features let networks learn high frequency functions in low dimen- sional domains

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. InProc. NeurIPS, pages 7537–7547. Curran Associates, Inc., 2020. 2, 6

  42. [42]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

  43. [43]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 4

  44. [44]

    Temporally aligned audio for video with autoregression

    Ilpo Viertola, Vladimir Iashin, and Esa Rahtu. Temporally aligned audio for video with autoregression. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

  45. [45]

    Retake: Reducing temporal and knowledge redundancy for long video understanding.arXiv preprint arXiv:2412.20504, 2024

    Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. ReTaKe: Reducing Temporal and Knowl- edge Redundancy for Long Video Understanding, 2024. arXiv:2412.20504 [cs]. 2

  46. [46]

    Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in Neural Information Processing Systems, 37:128118–128138, 2024

    Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao. Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in Neural Information Processing Systems, 37:128118–128138, 2024. 2

  47. [47]

    Longmamba: Enhancing mamba’s long context capabilities via training-free receptive field en- largement.arXiv preprint arXiv:2504.16053, 2025

    Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. Longmamba: Enhancing mamba’s long context capabilities via training-free receptive field en- largement.arXiv preprint arXiv:2504.16053, 2025. 5

  48. [48]

    Audio-synchronized visual animation

    Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Mor- gado. Audio-synchronized visual animation. InECCV, 2024. 1

  49. [49]

    Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024. 1, 2

  50. [50]

    Long-video audio synthesis with multi-agent collabo- ration.arXiv preprint arXiv:2503.10719, 2025

    Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, and Yingcong Chen. Long-video audio synthesis with multi-agent collabo- ration.arXiv preprint arXiv:2503.10719, 2025. 1, 3 10