arxiv: 2602.20981 · v3 · submitted 2026-02-24 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Christian Simon , Masato Ishii , Wei-Yao Wang , Koichi Saito , Akio Hayakawa , Dongseok Shim , Zhi Zhong , Shuyang Cui

show 3 more authors

Shusuke Takahashi Takashi Shibuya Yuki Mitsufuji

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video-to-audio generationlength generalizationhierarchical networksMambamultimodal alignmentlong-form audio

0 comments

The pith

Models trained only on short video clips can generate coherent audio exceeding five minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video-to-audio models can generalize from short training examples to long test videos without ever seeing extended durations during training. It introduces MMHNet, which adds a hierarchical structure and non-causal Mamba blocks to align video frames with audio over extended time spans. This removes the requirement for scarce long multimodal training pairs and mitigates distribution shift. Experiments show improved results on long-video benchmarks and successful generation beyond five minutes where earlier methods fail.

Core claim

A multimodal hierarchical network augmented with non-causal Mamba enables length generalization in video-to-audio generation, so that training exclusively on short instances produces usable audio for videos longer than five minutes at test time.

What carries the argument

MMHNet: a multimodal hierarchical network that combines hierarchical feature processing with non-causal Mamba blocks to capture long-range video-audio temporal dependencies.

Load-bearing premise

The hierarchical structure plus non-causal Mamba can maintain alignment and coherence across long video-audio sequences even when no long examples were present in training.

What would settle it

Run the model on videos several times longer than the training clips and check whether audio-video synchronization scores or human coherence ratings collapse compared with short-video results.

Figures

Figures reproduced from arXiv: 2602.20981 by Akio Hayakawa, Christian Simon, Dongseok Shim, Koichi Saito, Masato Ishii, Shusuke Takahashi, Shuyang Cui, Takashi Shibuya, Wei-Yao Wang, Yuki Mitsufuji, Zhi Zhong.

**Figure 2.** Figure 2: We analyze the role of positional embeddings in V2A models such as MMAudio [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our proposed framework. Left: A comprehensive end-to-end flow-matching model that operates across both [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of audio spectogram from MMHNet and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison with past methods on various duration splits [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims to show length generalization in video-to-audio models by training only on short clips and testing on videos over 5 minutes using hierarchical MMHNet plus non-causal Mamba, but the visible evidence is too thin to assess whether it actually works.

read the letter

The main thing to know is that the authors present MMHNet as a hierarchical extension to existing video-to-audio models that adds non-causal Mamba, and they claim this lets the model train exclusively on short clips while producing coherent audio for long videos at test time, up to more than 5 minutes, without ever seeing long-duration examples during training. They position this as a proof that short-to-long generalization is possible in this domain. What is new is the specific pairing of hierarchical multimodal structure with non-causal Mamba for handling extended temporal dependencies across video and audio. Earlier video-to-audio work stayed with short clips, so focusing on this scaling route without new long data is a direct response to a real practical barrier. The paper does a reasonable job laying out the core difficulties: limited training data and the mismatch between high-level text descriptions and frame-level video details. The architectural choice looks motivated by the need to propagate information over longer ranges without the usual degradation in autoregressive or fixed-context setups. The soft spots are more noticeable. The abstract states the results and the proof but gives no numbers, no baselines, no ablation breakdowns, and no description of the experimental protocol or datasets. Without those, it is hard to tell whether the hierarchy actually prevents error accumulation or whether the non-causal Mamba component truly sidesteps distribution shift when scene dynamics and audio statistics evolve over minutes. The central assumption that the model can maintain alignment and consistency far beyond the training length remains unverified from the given text. If the full paper contains detailed metrics and controls, that would change the picture, but on current evidence the claim rests on an untested extrapolation. This work is aimed at researchers building generative models for audio from video who care about scaling to real-world long-form media. Someone already working with Mamba or hierarchical designs might pick up useful architecture details. It deserves a serious referee so the experiments can be checked for robustness and proper comparisons.

Referee Report

2 major / 1 minor

Summary. The paper introduces MMHNet, a multimodal hierarchical extension of video-to-audio models that combines hierarchical temporal modeling with non-causal Mamba. It claims this architecture enables length generalization: models trained exclusively on short clips can generate coherent audio for videos exceeding 5 minutes at test time, without any long-duration training data, and that this 'proves' training-short/testing-long is feasible while outperforming prior video-to-audio methods on long-video benchmarks.

Significance. If the length-generalization results hold with proper validation, the work would address a key scaling bottleneck in multimodal generation—limited availability of long-form aligned video-audio data—by allowing extrapolation beyond training lengths. The hierarchical + non-causal Mamba design offers a plausible route to maintaining temporal alignment and acoustic consistency over extended durations, which could have practical impact on video editing and long-form content synthesis.

major comments (2)

Abstract: The central claim that 'we prove that training short and testing long is possible' and that the method 'significantly improves long audio generation up to more than 5 minutes' is presented without any quantitative metrics, baselines, ablation studies, or experimental protocol. This absence is load-bearing because the title, abstract, and contribution rest entirely on an empirical demonstration of length generalization that cannot be assessed from the given text.
Abstract: The assumption that the hierarchical MMHNet structure plus non-causal Mamba prevents error accumulation and distribution shift when extrapolating far beyond training lengths (e.g., evolving scene dynamics or cumulative audio drift) is stated but not supported by any mechanism description, theoretical argument, or empirical test. This is load-bearing for the length-generalization claim.

minor comments (1)

Abstract: Phrases such as 'remarkable results' and 'beating prior works' are used without naming the specific benchmarks, comparison methods, or quantitative improvements, reducing clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments regarding the abstract below, providing clarifications from the full paper and indicating planned revisions to strengthen the presentation of our claims and mechanisms.

read point-by-point responses

Referee: Abstract: The central claim that 'we prove that training short and testing long is possible' and that the method 'significantly improves long audio generation up to more than 5 minutes' is presented without any quantitative metrics, baselines, ablation studies, or experimental protocol. This absence is load-bearing because the title, abstract, and contribution rest entirely on an empirical demonstration of length generalization that cannot be assessed from the given text.

Authors: We acknowledge that the abstract lacks specific quantitative metrics and protocol details, which limits immediate assessment. The full manuscript (Sections 4 and 5) reports results on long-video-to-audio benchmarks, including comparisons against prior video-to-audio methods with metrics such as audio quality scores and temporal alignment measures, demonstrating coherent generation beyond 5 minutes without long-duration training data. We will revise the abstract to include representative quantitative improvements and a brief reference to the benchmarks and setup. revision: yes
Referee: Abstract: The assumption that the hierarchical MMHNet structure plus non-causal Mamba prevents error accumulation and distribution shift when extrapolating far beyond training lengths (e.g., evolving scene dynamics or cumulative audio drift) is stated but not supported by any mechanism description, theoretical argument, or empirical test. This is load-bearing for the length-generalization claim.

Authors: The method section details how the hierarchical temporal modeling captures multi-scale dependencies while non-causal Mamba enables bidirectional long-range context without causal accumulation of errors. We will add a concise mechanism description to the abstract and expand the paper with additional empirical tests (e.g., drift analysis over extended sequences) and a brief theoretical note on Mamba's state-space properties for extrapolation. This revision will better support the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical result with no derivation chain or self-referential definitions

full rationale

The paper advances an empirical architecture (MMHNet: hierarchical extension of video-to-audio models using non-causal Mamba) and reports experimental outcomes on long-video benchmarks, including generation beyond 5 minutes when trained only on short clips. No equations, parameter-fitting steps, or formal derivations appear in the provided text. The central claim (training short, testing long is possible) is presented as an observed experimental outcome rather than a mathematical reduction to the model's own inputs or prior self-citations. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renamings of known results are invoked. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details remain at the level of high-level architecture names.

pith-pipeline@v0.9.0 · 5517 in / 955 out tokens · 37084 ms · 2026-05-15T19:50:57.624646+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt the Mamba-2 architecture [6], which inherently supports sequence modeling without explicit positional embeddings... Non-Causal Mamba-2 [37] for two key reasons: 1) video conditions are available offline... 2) multimodal fusion across multiple modalities is difficult without a predefined order.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a hierarchical framework... temporal routing... MM routing... Chunking with downsampling... Dechunking with upsampling.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 8 internal anchors

[1]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 6

work page internal anchor Pith review Pith/arXiv arXiv 2013
[2]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 6

work page 2020
[3]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large lan- guage models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025. 1, 2, 3, 4, 6, 7, 8

work page 2025
[5]

Lova: Long-form video-to-audio generation

Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, and Rui- hua Song. Lova: Long-form video-to-audio generation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025. 2, 6, 7

work page 2025
[6]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: General- ized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024. 2, 3, 4, 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[8]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, Ryan C Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE, 2017. 6

work page 2017
[9]

Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline

Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, and Feng Zheng. Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22942–22951, 2023. 1, 2, 3, 6, 7

work page 2023
[10]

Longvale: Vision-audio- language-event benchmark towards time-aware omni-modal perception of long videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. Longvale: Vision-audio- language-event benchmark towards time-aware omni-modal perception of long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18959– 18969, 2025. 1, 2, 6, 7

work page 2025
[11]

Imagebind: One embedding space to bind them all.arXiv preprint arXiv:2305.05665, 2023

Rohit Girdhar, Alexander Kirillov, Mathilde Caron, Ross Girshick, Piotr Doll ´ar, and Ishan Misra. Imagebind: One embedding space to bind them all.arXiv preprint arXiv:2305.05665, 2023. 7

work page arXiv 2023
[12]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Lm-infinite: Zero-shot extreme length generalization for large language models

Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models. InProceed- ings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pages 3991–4008, 2024. 3

work page 2024
[14]

Mambavision: A hybrid mamba-transformer vision backbone

Ali Hatamizadeh and Jan Kautz. Mambavision: A hybrid mamba-transformer vision backbone. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25261–25270, 2025. 2

work page 2025
[15]

Zigma: A dit-style zigzag mamba diffusion model

Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Bjorn Ommer. Zigma: A dit-style zigzag mamba diffusion model. InArxiv, 2024. 2

work page 2024
[16]

Dynamic chunking for end-to-end hierarchical sequence modeling

Sukjun Hwang, Brandon Wang, and Albert Gu. Dynamic chunking for end-to-end hierarchical sequence modeling. arXiv preprint arXiv:2507.07955, 2025. 2, 6

work page arXiv 2025
[17]

Taming visually guided sound generation.arXiv preprint arXiv:2110.08791, 2021

Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation.arXiv preprint arXiv:2110.08791, 2021. 1, 2

work page arXiv 2021
[18]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 2, 4, 6, 7

work page 2024
[19]

The impact of positional encoding on length generalization in transform- ers.Advances in Neural Information Processing Systems, 36: 24892–24928, 2023

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Nate- san Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transform- ers.Advances in Neural Information Processing Systems, 36: 24892–24928, 2023. 2

work page 2023
[20]

Panns: Large-scale pretrained audio neural networks for audio pattern recogni- tion

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recogni- tion. InIEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, pages 2880–2894. IEEE, 2020. 6, 7

work page 2020
[21]

Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069, 2021

Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069, 2021. 6

work page arXiv 2021
[22]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 4

work page 2024
[23]

Diff-bgm: A diffusion model for video background mu- sic generation

Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, and Yang Liu. Diff-bgm: A diffusion model for video background mu- sic generation. InCVPR, 2024. 1

work page 2024
[24]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4 9

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023. 1, 2

work page 2023
[27]

Text-to- audio generation synchronized with videos.arXiv preprint arXiv:2403.07938, 2024

Shentong Mo, Jing Shi, and Yapeng Tian. Text-to- audio generation synchronized with videos.arXiv preprint arXiv:2403.07938, 2024. 2

work page arXiv 2024
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[29]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, 2024. 2

work page 2024
[30]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2018. 4

work page 2018
[31]

Plumbley, Thomas Blumensath, Laurent Daudet, R´emi Gribonval, and Mike E

Mark D. Plumbley, Thomas Blumensath, Laurent Daudet, R´emi Gribonval, and Mike E. Davies. Sparse representa- tions in audio and music: From coding to source separation. Proceedings of the IEEE, 98(6):995–1005, 2010. 2

work page 2010
[32]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 4, 6

work page 2021
[33]

Soundreac- tor: Frame-level online video-to-audio generation.arXiv preprint arXiv:2510.02110, 2025

Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, and Yuki Mitsufuji. Soundreac- tor: Frame-level online video-to-audio generation.arXiv preprint arXiv:2510.02110, 2025. 2

work page arXiv 2025
[34]

Ssamba: Self-supervised audio representation learning with mamba state space model

Siavash Shams, Sukru Samet Dindar, Xilin Jiang, and Nima Mesgarani. Ssamba: Self-supervised audio representation learning with mamba state space model. In2024 IEEE Spoken Language Technology Workshop (SLT), pages 1053–

work page
[35]

Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation, 2025. 2, 6, 7

work page 2025
[36]

I hear your true colors: Im- age guided audio generation

Roy Sheffer and Yossi Adi. I hear your true colors: Im- age guided audio generation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 1, 2, 3

work page 2023
[37]

Vssd: Vision mamba with non-causal state space duality

Yuheng Shi, Minjing Dong, Mingjia Li, and Chang Xu. Vssd: Vision mamba with non-causal state space duality. arXiv preprint arXiv:2407.18559, 2024. 3, 4, 8

work page arXiv 2024
[38]

Titan-guide: Taming inference-time alignment for guided text-to-video diffusion models

Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong, Shusuke Takahashi, Takashi Shibuya, and Yuki Mitsufuji. Titan-guide: Taming inference-time alignment for guided text-to-video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16662–16671, 2025. 2

work page 2025
[39]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[41]

Fourier features let networks learn high frequency functions in low dimen- sional domains

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. InProc. NeurIPS, pages 7537–7547. Curran Associates, Inc., 2020. 2, 6

work page 2020
[42]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 4

work page 2017
[44]

Temporally aligned audio for video with autoregression

Ilpo Viertola, Vladimir Iashin, and Esa Rahtu. Temporally aligned audio for video with autoregression. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

work page 2025
[45]

Retake: Reducing temporal and knowledge redundancy for long video understanding.arXiv preprint arXiv:2412.20504, 2024

Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. ReTaKe: Reducing Temporal and Knowl- edge Redundancy for Long Video Understanding, 2024. arXiv:2412.20504 [cs]. 2

work page arXiv 2024
[46]

Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in Neural Information Processing Systems, 37:128118–128138, 2024

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao. Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in Neural Information Processing Systems, 37:128118–128138, 2024. 2

work page 2024
[47]

Longmamba: Enhancing mamba’s long context capabilities via training-free receptive field en- largement.arXiv preprint arXiv:2504.16053, 2025

Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. Longmamba: Enhancing mamba’s long context capabilities via training-free receptive field en- largement.arXiv preprint arXiv:2504.16053, 2025. 5

work page arXiv 2025
[48]

Audio-synchronized visual animation

Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Mor- gado. Audio-synchronized visual animation. InECCV, 2024. 1

work page 2024
[49]

Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024. 1, 2

work page arXiv 2024
[50]

Long-video audio synthesis with multi-agent collabo- ration.arXiv preprint arXiv:2503.10719, 2025

Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, and Yingcong Chen. Long-video audio synthesis with multi-agent collabo- ration.arXiv preprint arXiv:2503.10719, 2025. 1, 3 10

work page arXiv 2025